![1小時(shí)視頻-語言理解 HourVideo -1-Hour Video-Language Understanding_第1頁](http://file4.renrendoc.com/view14/M08/27/23/wKhkGWcyBKaAaR4dAANOFORUtCY292.jpg)
![1小時(shí)視頻-語言理解 HourVideo -1-Hour Video-Language Understanding_第2頁](http://file4.renrendoc.com/view14/M08/27/23/wKhkGWcyBKaAaR4dAANOFORUtCY2922.jpg)
![1小時(shí)視頻-語言理解 HourVideo -1-Hour Video-Language Understanding_第3頁](http://file4.renrendoc.com/view14/M08/27/23/wKhkGWcyBKaAaR4dAANOFORUtCY2923.jpg)
![1小時(shí)視頻-語言理解 HourVideo -1-Hour Video-Language Understanding_第4頁](http://file4.renrendoc.com/view14/M08/27/23/wKhkGWcyBKaAaR4dAANOFORUtCY2924.jpg)
![1小時(shí)視頻-語言理解 HourVideo -1-Hour Video-Language Understanding_第5頁](http://file4.renrendoc.com/view14/M08/27/23/wKhkGWcyBKaAaR4dAANOFORUtCY2925.jpg)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
HourVideo:1-HourVideo-LanguageUnderstanding
KeshigeyanChandrasegaranAgrimGuptaLeaM.HadzicTaranKotaJimmingHeCristobalEyzaguirreZaneDuranteManlingLiJiajunWuLiFei-Fei
arXiv:2411.04998v1[cs.CV]7Nov2024
StanfordUniversity
Abstract
WepresentHourVideo,abenchmarkdatasetforhour-longvideo-languageun-derstanding.Ourdatasetconsistsofanoveltasksuitecomprisingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.HourVideoincludes500manuallycuratedegocentricvideosfromtheEgo4Ddataset,span-ningdurationsof20to120minutes,andfeatures12,976high-quality,five-waymultiple-choicequestions.Benchmarkingresultsrevealthatmultimodalmod-els,includingGPT-4andLLaVA-NeXT,achievemarginalimprovementsoverrandomchance.Instarkcontrast,humanexpertssignificantlyoutperformthestate-of-the-artlong-contextmultimodalmodel,GeminiPro1.5(85.0%vs.37.3%),highlightingasubstantialgapinmultimodalcapabilities.Ourbenchmark,evalua-tiontoolkit,prompts,anddocumentationareavailableat
.
1Introduction
Humansdemonstratearemarkableabilitytoprocessvisualstimulioverlongtimehorizons,enablingthemtoperceive,planandactintherealworld.Considertheroutinetaskofcookingameal.Thisactivityinvolvesacontinuousandadaptivevisualprocess:identifyingandusingingredientsandtools,monitoringstatechangesofvariousdishes,andadjustingcookingduration/techniquesbasedonvisualcuessuchascolorandtexture.Suchsustainedvisualprocessingiscrucialtoachievingthedesiredculinaryoutcomes.Naturally,endowingautonomousagentswiththiscapabilityhasbeenalong-standinggoalinthefieldofArtificialIntelligence.
Inrecentyears,largemultimodalmodels[
1
–
3
]haveemergedasapromisingapproachtowardachievingthisgoal.Typically,thesemodelsareevaluatedusingmultipledatasetsthattestcapabilitiessuchasobjectrecognition[
4,
5
],imagecomprehension[
6
–
8
],andactionrecognition[
9]
.However,thesebenchmarksareoftenrestrictedtosingleimagesorshortvideoclips,usuallylastingfromafewsecondstonomorethanthreeminutes[
9
–
12]
.Whilethesebenchmarkshavespurredsignificantadvancements,adeeperexplorationintolong-formvideo-languageunderstandingisessentialtodevelopmultimodalsystemsthatcanformthebasisforfutureautonomousagentsandassistants.
Asignificantchallengeinevaluatinglong-formvideo-languageunderstandingcapabilitiesisdesign-ingtasksthatgenuinelynecessitatelong-termcomprehension,i.e.,tasksthatrequirelong-rangedependencies.Merelyposingquestionsthatcanbeansweredbywatchingabriefsegmentofalengthyvideoeffectivelyreducesthetasktoacombinationoftemporallocalizationandshort-clipunderstanding.Furthermore,whileintriguingnarrativeinquiriescancertainlybeformulatedforlong-formvideossuchastelevisionshowsandfilms,itisimperativetoensurethatthequestionsarenottriviallyanswerableduetothevastpriorknowledgeencodedinmodernlargelanguagemodels.Inthiswork,weintroduceHourVideo—abenchmarkdatasetdesignedforlong-formvideo-languageunderstanding.Todesigntasksthatrequirelong-termcomprehension,wefirstproposeanoveltask
Correspondenceto{keshik,agrim}@stanford.edu
38thConferenceonNeuralInformationProcessingSystems(NeurIPS2024)TrackonDatasetsandBenchmarks.
2
suite(Tab.
1
),comprisingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.Foreachtask,wemanuallycreatequestionprototypesdesignedtoensurethatcorrectlyansweringthemrequiresidentificationandsynthesisofinformationacrossmultipletemporalsegmentswithinthelong-formvideos.Guidedbyourtasksuite,wecurated500egocentricvideosfromtheEgo4Ddataset[
13
]—covering77uniqueeverydayactivitiesandrangingfrom20to120minutes—togeneratequestionsbasedonourprototypes.Thecombinationofourcomprehensivetasksuiteandeverydaymundaneegocentricvideosprovidesarobustframeworktorigorouslyevaluatemultimodalmodels’capabilitiesinunderstandinglong-formvideos.Finally,wedevelopedaquestion-answergenerationpipelineutilizingtheexpertiseoftrainedhumanannotators(800+hoursofeffort)andlargelanguagemodels(LLMs),resultinginacollectionof12,976high-quality,five-waymultiple-choicequestions.Wecomprehensivelyevaluatestate-of-the-artmultimodalmodelsonHourVideo(Tab.
2
,Fig.
4
),includingGPT-4V[
2
],Gemini1.5Pro[
3
],andLLaVA-NeXT[
14
]inazero-shotsetting.OurfindingsrevealthatGPT-4VandLLaVA-NeXTachieveonlymarginalimprovementsoverarandompredictor(20%),obtainingaccuraciesof25.7%and22.3%,respectively.Gemini1.5Pro,designedspecificallyforlong-contextmultimodalunderstanding,obtainsanaccuracyof37.3%,which,whilebetter,isstillsubstantiallylowerthantheaverageperformanceofhumanexpertsat85.0%.Theseresultssuggestthatwhilethemultimodalcommunityhasmademeaningfulprogress,asignificantgapremainstobebridgedbeforethesesystemscanmatchhuman-levellong-formvideounderstandingcapabilities.Progressinlong-formvideounderstandingcouldenablenewapplicationsincludingARassistants,embodiedagents,andinteractivevideoplatforms.WehopethatHourVideowillserveasabenchmarktofacilitateresearchinthisdirectionandenablethedevelopmentofmultimodalmodelsthatcanunderstandendlessstreamsofvisualdata.
2BenchmarkDesignandConstruction
Whileopen-endedquestionansweringcloselyemulateshumaninteraction,automatingtheevaluationoffree-formnaturallanguageresponsesremainschallenging.Giventhatourprimarygoalistoassesslong-formvideo-languageunderstandingcapabilities,weoptforafive-waymultiple-choicequestion-answering(MCQ)task.Thisapproachsimplifiestheevaluationprocessbyallowingtocalculateanaggregatequestion-answeringaccuracymetric.Inthefollowingsection,wedescribeourtasksuiteandquestion-answergenerationpipelineindetail,bothofwhicharedesignedtocuratediversehigh-qualityfive-waymultiple-choicequestions(MCQs).
2.1TaskSuite
Creatingacomprehensivebenchmarkforlong-formvideo-languageunderstandingischallenging,primarilybecauseformulatingmeaningfulquestionsthatrequireprocessingandsynthesizingin-formationacrossvarioustemporalsegmentsishighlynontrivial,evenforexperthumanannotators.Moreover,wenotethatevenbenchmarksforimageorshortvideoclipunderstandingaredifficulttoconstruct.Asaresult,wetypicallyobservetwocommonstrategiesforbenchmarkcreation:(1)pre-definedlabelspacestestingforaspecificskillorwithinnarrowdomains(e.g.,Kinetics[
9
]andSomething-Something[
15
]);or(2)gluingtogetherdifferentdatasets,eachdesignedtotestaspecificmodelcapability[
16
–
19]
.Incontrast,asinglebenchmarkthatcancomprehensivelytestasuiteofmodelcapabilitiescansignificantlybenefittheresearchcommunity.
Wedrawinspirationfrombothlinesofresearchmethodologiesandintroduceanovelsuiteoftasksdesignedtobenchmarklong-formvideo-languageunderstandingcapabilitiesforone-hour-longvideos.Ourtasksuiteencompassesacomprehensivesetofperceptualandcognitivetasks,includingsummarization,perception(recall,tracking),visualreasoning(spatial,temporal,predictive,causal,counterfactual),andnavigation(room-to-room,objectretrieval)tasks.Ourstrategydrawsinspirationfromthetwocommonapproachespreviouslydiscussed:(1)designingnarrowlyfocusedquestionprototypestosignificantlystreamlinethequestion-answercreationprocess,and(2)creatingadiversesuiteoftasksthatholisticallyevaluateabroadspectrumofmultimodalcapabilities.OurtasksuitewithmanuallydesignedquestionprototypesareshowninTable
1.
Inparticular,thereare18sub-tasksinourproposedtasksuiteandexampleMCQsfromHourVideoareshowninFig.
1.
3
VisualReasoning
Summarization
01:10:26
00:54:52
TemporalSequencing
00:00:40
00:43:32
00:01:0000:04:5800:16:44
Describethesequenceofactivitiesthecamerawearerperformedrelatedtopreparationandcookingoffood.
A)Thecamerawearertakesouttheingredients,peels,cuts,andcooksthepotatoes,continuestomashtheminthepot
B)Peeled,chopped,andcookedpotatoes,interactedwithindividuals,adjustedcookingsettings,andsetthediningtable.
C)Takesouttheingredients,peels,cuts,andcooksthepotatoes,coolsthepotatoeswithcoldwater,continuestomashtheminthepot,andadjuststhecookersetting.
D)Thecamerawearersliced,diced,andboiledpotatoes,interactedwithindividuals,andmodi?edcookingtimes.
E)Thecamerawearerpeeled,chopped,andsautéedvegetables,interactedwithindividuals,andadjustedcookingsettings,demonstratingamethodicalapproachtomealpreparation.
00:22:48
Selectthecorrectstatementregardingthespatialproximityofobjectsinthevideo.
A)Thecamerawearer'sseatisequidistantfromboththedriver'sseatandthebusdooronthebus.
B)Thecashierisclosertothediningtablewherethecamerawearereatspizzathanthetrashbin.
C)Thedriver'sseatispositioneddirectlyacrossfromthecamerawearer'sseat,whilethebusdoorisbehindthecamerawearer.
D)Theweighingstationisadjacenttotheentrance,withthebananasectionatthefarend.
E)Theentranceisnearertotheweighingstationthanthebananasectionatthestore.
Predictive00:46:24
Temporal00:32:30
00:40:03
00:38:12
00:44:00
00:20:56
00:21:51
00:00:12
00:06:12
00:09:48
00:40:23
00:20:36
00:29:52
00:19:30
iiiiiwri
AftershoppingandinteractingwiththeCashieratthecheckout,whatwillthecamerawearerdonext?
A)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoawoman,thencyclebackhome.
B)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoWoman,buysadrinkattheexit,thenruntowardsthebusstand.
C)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoaman,thenwalktowardsthebusstop.
D)Looksattheirphonewhilepushingatrolley,exitsthestore,handsoverthetrolleytoWoman,thenruntowardsthebusstand.
E)Looksattheirphonewhilepushingatrolley,exitsthestore,buysadrinkattheexit,thenruntowardsthebusstand.
Selectthecorrectstatementregardingfrequenciesofdifferenttoolusageinthevideo
A)Duringtheconstructionactivities,themitersawwasusedmorefrequentlycomparedtothecordlessdrill.
B)Duringthewoodworkingtasks,thetapemeasurewasusedmoreoftencomparedtotheruler.
C)Duringthedeckconstruction,thehammerwasusedmorefrequentlycomparedtotheimpactdriver.
D)Duringwoodworkingactivity,thetapemeasurewasusedmorefrequentlycomparedtothecircularsaw.
E)Duringthewoodworkingactivity,themitersawwasusedmorefrequentlycomparedtothetapemeasure.
Causal
01:17:11
00:10:28
00:09:50
00:09:58
00:12:24
00:20:26
00:26:32
Whydidthechildmovethestepstoolnearthekitchencountertop?
A)Thechildmovedthestepstoolnearthekitchencountertoptoreachitandhelpwithpreparingdoughandbeatingeggs.
B)Thechildmovedthestepstoolnearthekitchencountertoptoaccessitandretrievethecookiejar.
C)Thechildmovedthestepstoolnearthekitchencountertoptoreachthesinkoveritandwashherhands.
D)Thechildmovedthestepstoolnearthekitchencountertoptoreachitandhelpwithpreparingdough.
E)Thechildmovedthestepstoolnearthekitchencountertoptoaccessthetopdraweraboveitand?ndthemeasuringspoons.
Counterfactual01:10:26
00:43:32
`wrii
00:39:01
00:54:52
00:16:44
00:15:45
00:10:55
Whatifthecamerawearerusedtheoventomakemashedpotatoes?
A)Overallcookingtimewouldhaveincreasedastheovenwasalsousedbythecamerawearertobakecookies.
B)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousinginductionstove.
C)Overallcookingtimewouldhaveincreasedastheovenwasalsousedbythemantobakecookies.
D)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousinggascooker.
E)Overallcookingtimewouldhaveincreasedastheovenwouldconsumemoretimecomparedtousingmicrowave.
00:46:24
Spatial
00:02:34
00:04:50
00:23:47
00:44:00
00:18:16
Perception
01:17:11
01:13:37
InformationRetrieval/FactualRecall
00:00:2000:33:0500:33:2600:34:0800:35:31
Tracking00:30:00
00:00:0000:03:5500:04:0400:09:1300:18:2300:30:00
Listthelocationsthecamerawearervisited.
Person2Person3Person1
<1minute8minutes30minutes
A)Kitchen,BBQArea,StorageRoom,Garage,Room,Pavements
>
Identifytheuniqueindividualsthecamerawearerinteractedwith.
A)2AdultsB)1AdultC)4AdultsD)5AdultsE)3Adults
B)Kitchen,Garden,StorageRoom,Garage,Room,Pavements
C)Kitchen,Bathroom,StorageRoom,Garage,Room,Pavements,
D)Kitchen,Room,Balcony,StorageRoom,Garage,LivingRoom
E)Kitchen,Balcony,StorageRoom,Garage,Room,Pavements
Navigation
01:17:11
00:34:0800:35:31
00:33:0500:33:2600:33:50
00:00:20
01:13:3701:13:5001:17:02
Room-to-RoomNavigation:Howcanthecamerawearergettothebackyardfromthekitchen?
A)C)D)E)
ObjectRetrieval:Howcanthecamerawearerretrievethemotorcyclefromthekitchen?
A)Exitthekitchentowardsthestairsandexitthroughthedoor.Themotorbikeisoutside.
B)Exitthekitchenthroughthedoorintothebackyard,andthemotorbikeisontheright.
C)Exitthekitchenandturnleft.Walkthroughthelivingroomandgothroughthedoorintothebackyard.Themotorbikeisontheright.
D)Exitthekitchentothelivingroomandturnleft.Gothroughthedoortothebackyard;themotorbikeisontheright.
E)Exitthekitchenandturnleft.Walkdownthehallwayandturnrightbeforethestairs.Exitthedoorandthemotorbikeisoutside.
Figure1:ExampleMCQsfromHourVideofordifferenttasks.Thecorrectanswersareunderlined.
4
Summarization
KeyEvents/Objects
Summarizethekeyinteractionsofthecamerawearerinthe[supermarket].
TemporalSequencing
Describethesequenceofactivitiesperformedbythecamerawearerto[preparethedessert].
Compare/Contrast
Howdidthecamerawearer’sactivitiesinthe[apartment]differfromthoseinthe[restaurant]?
Perception
InformationRetrieval
?FactualRecall
What[dairyproducts]didthecamerawearer[pickup]inthe[supermarket]?
?SequenceRecall
Whatdidthecamerawearerdoimmediatelyafter[weighingtomatoes]atthe[supermarket]?
?TemporalDistance
Howlongafterstartingto[eatpizza]didthecamerawearer[disposeofthepizzabox]?
Tracking
Listtheunique[individuals]thecamerawearerinteractedwithatthe[drugstore].
VisualReasoning
Spatial
?Relationship
Wherewasthe[microwave]placedinrelationtothe[stove]inthe[kitchen]?
?Proximity
Isthe[microwave]closertothe[fridge]comparedtothe[sink]?
?Layout
Whichisthecorrect[IMAGE]depictingthelayoutofthecamerawearer’s[apartment]?
Temporal
?Duration
Whichactivitydidthecamerawearerspendmoretimeon:[cooking]or[playingthepiano]?
?Frequency
Didthecamerawearerusethe[circularsaw]or[crosscutsaw]morefrequentlyto[cutwood]?
?Pre-requisites
Whatpreparationstepsdidthecamerawearertakebefore[bakingcookies]?
Predictive
Whatisthemostlikelyactivitythecamerawearerwilldonextafter[doinglaundry]?
Causal
Whydidthecamerawearer[leavethegarageforthesecondtime]?
Counterfactual
Whatifthecamerawearerusedthe[oven]to[cookmashedpotatoes]?
Navigation
Room-to-Room
Howdidthecamerawearergetfromthe[buildingentrance]tothe[apartment]?
ObjectRetrieval
Howcanthecamerawearerretrievethe[TVremote]iftheyareinthe[kitchen]?
Table1:Ourproposedtasksuitewithquestionprototypes.Thistableshowsall4tasksand18sub-tasksproposedinHourVideo,alongwiththecorrespondinghandcraftedquestionprototypesdesignedtoevaluatelong-formvideo-languageunderstandingcapabilities.
2.2DatasetGenerationPipeline
Inthissection,weprovideanoverviewofthequestion-answercreationpipelinethatwedevelopedtocreateHourVideo.ThepipelineissummarizedinFig.
2.
Videocuration,Stage1.Acrucialdesignconsiderationforthisbenchmarkistheselectionofvideosourcesandtypes.WechosetheEgo4D[
13
]datasetforourvideosformultiplereasons:(1)itsegocentricperspectivealignswellwiththetypicalvisualinputforautonomousagentsandassistants;(2)itfeaturesextensivevisualnarrations,whichaidincreatingdiversemultiple-choicequestions;and(3)itisreadilyaccessibleundertheEgo4Dlicense.Wemanuallyreviewed1,470videos,rangingfrom20to120minutes,fromtheEgo4Ddataset,assessingtheirpotentialtogeneraterelevantquestionsforvarioustasksinourtasksuite.Weengagedfivehumanexpertsforvideocuration.Followingthisprocess,wecurated500egocentricvideos.
5
VideoCuration
心
>100hours
BlindFiltering
MCQGeneration
3MCQRe?nementusingHumanFeedback
LLM
LLM
+
>400hours
MCQ2
MCQ3
MCQ2
ExpertMCQRe?nement
LLM
MCQ3MCQ4
心
+
>300hours
MCQ4
MCQ5
1
2
4
5
Figure2:Ourdatasetgenerationpipeline.WedevelopadatasetgenerationpipelineconsistingoffivestagestocreateHourVideo.Weleverageover800hoursofhumaneffortintotalcorrespondingtoVideocuration(Stage1),MCQRefinementusingHumanFeedback(Stage3)andExpertMCQRefinement(Stage5)stages.WeuseLLMsforMCQGeneration(Stage2),MCQRefinementusingHumanFeedback(Stage3)andBlindFiltering(Stage4).Wenotethatcausal,counterfactualandnavigationquestionsaremanuallygeneratedbyhumanexperts(SeeSec.
2.2
fordetails).
CandidateMCQGeneration,Stage2.Theobjectiveofthisstageistoproducehigh-qualityMCQsforeachtask,requiringanalysisandsynthesisofinformationacrossmultipletemporalsegmentsinalong-formvideo.Initially,wemanuallydevelopquestiontemplate(s)foreachtaskinthesuite.AsshowninTable
1
,transformingaquestiontemplateintoanactualquestioninvolvesincorporatingvideo-specificinformationtailoredtothetaskandtemplate.Tofacilitatethis,weutilizethedetailednarrationsfromtheEgo4Ddataset,transformingthemintoastructuredformatthatcanbeprocessedbyanLLM.Specifically,wesegmentthevideoat20-minuteintervals,witheachsegment’srepresentationincludingasummaryandalistoftools,fooditems,technology,humans,pets,andphysicallocationsencounteredbythecamerawearerinthevideo.Wenotethatsynthesizingastructuredrepresentationandaquestiontemplateintoavalidquestionwithcorrectandincorrectanswerspresentsasignificantchallenge,evenforadvancedLLMs.Consequently,foreachtask,weformulatedetailedpromptsthatofferquestionprototypes,comprehensiveinstructions,in-contextexamples,andstep-by-stepguidanceonhowtotransformaquestiontemplateintoavalidcandidateMCQ2.Intotal,wedeveloped25task-specificprompts.
MCQRefinementwithLLMsusingHumanFeedback,Stage3.ThepurposeofthisphaseistorefineMCQ2,createdinthepreviousstage.MCQ2maycontaininvalidquestions,incorrectanswers,trivialincorrectoptions,andvariousotherissues.WeidentifiedthatasignificantsourceoftheseissuesstemmedfromrelyingonthenoisynarrationsinEgo4D.Forexample,differentnarratorswithinthesamevideocouldrefertoadishwasherasa"platerack"oruseotherterms,andanindividualmightbedescribedasan"adult,""personwitharedandwhiteshirt,""manY,"or"teenager"atvarioustimesinthenarration.Theseinconsistencies,combinedwithourautomaticquestiongenerationinthefirststage,couldleadtogenerationofinvalidMCQs.ToaddressnoisyMCQs,weimplementahumanfeedbacksystemwheretrainedannotatorsaretaskedwith:1)assessingthevalidityofeachquestiontoensureitalignswiththevideocontent,2)verifyingtheaccuracyofthegivenanswer—iffoundincorrect,theyprovidethecorrectanswerinfree-formtext,3)ensuringthatallincorrectoptionsarefactuallywrongandclearlydistinguishablefromthecorrectanswer.WegatherhumanfeedbackforallMCQ2,involvingover400hoursofhumaneffort.Wethendesignprompts,toautomaticallyrefineMCQ2usingthishumanfeedbacktoproduceMCQ3.Weengagedseventrainedannotatorsinthisstage.
Blindfiltering,Stage4.ModernLLMspossessextensivepriorknowledgeandcanthuseasilyanswercertainquestionswithoutneedingtoanalyzethevideos.Theobjectiveofthisphaseistoeliminatequestionsthatcanbeansweredthroughpriorknowledgeorcanbetriviallyansweredwithoutrequiringanyinformationfromthevideo.Toaddressthis,wedoblindfilteringofMCQ3,utilizingtwoseparateblindLLMs(GPT-4-turboandGPT-4).Specifically,weexcludeanyMCQthatiscorrectlyansweredbyatleastoneLLMwithoutvideoinput.AlthoughthismethodmayaggressivelyremoveMCQs,itensuresthattheremainingMCQ4areofhighqualityandspecificallytailoredtotestlong-formvideo-languageunderstanding.
6
2
Summarization(714)
KeyEvents/ObjectsIdenti?cation
467
TemporalSequencing
152
Compare/Contrast
95
Spatial(3173)
Relationship
1889
Proximity
1239
Layout
45
Perception(3777)Navigation(312)
FactualRecall2479
SequenceRecall854
TemporalDistance267
Tracking177Temporal(4292)
Duration
1945
Frequency
1815
Pre-requisites
532
Room-to-Room
ObjectRetrieval
Predictive(407)
Causal(150)
Counterfactual(151)
Cooking
Cleaning/laundryConstruction/renovationEating
Crafting/knitting
CarpenterTalkingwithfamilymembersIndoorNavigation(walking)
WatchingtvOnascreen(phone/laptop)
ListeningtomusicGroceryshoppingindoors
PlayingwithpetsWalkingonstreetGardening
Baker
DoingyardworkCar-commuting,roadtripBikemechanic
Workingatdesk
204
070
140
#MCQspervideo
20-30
30-40
40-50
50-60
60-70
70-80
80-90
90-100
100-110
110-120
Count
Count
1
120
192
4
2
3
140
105
70
35
0
3
Duration(inminutes)
Figure3:DatasetStatistics.?1:HourVideoincludes500videossourcedfromtheEgo4Ddataset,spanning77everydayscenarios.Thebarchartshowsthetop20scenarios.:WereportthenumberofMCQspertask/sub-task.Intotal,thereare12,976questionsinHourVideo.:WeshowthedistributionofvideodurationinHourVideo.TheaveragedurationofvideosinHourVideois45.7minutes,with113videosextendingbeyondonehour.:WeshowthedistributionofnumberofMCQspervideo.Onaverage,eachvideocontains26MCQs.
ExpertRefinement,Stage5.TheaimofthisstageistoenhancethequalityofMCQ4byutilizingaselectedgroupofexperthumanannotators.Thisstageservesasacomprehensivesteptoaddressvariousremainingissuesthatmighthavepersistedthroughpriorstages.Examplesofexpertrefinementincludetransformingabroadquestionlike"Wheredidthecamerawearerleavethekeys?"intoamoreprecisequery:"Wheredidthecamerawearerleavethebikekeysafterreturninghomefromshopping?”O(jiān)ver300hoursofexperthumaneffortareemployedinthisstagetocarefullyexamineandrefineMCQ4,culminatinginahigh-qualityMCQ5.Weengagedfourhumanexpertsinthisstage.
ManualGeneration.Despiteourextensiveeffortstoautomatefullyorpartially,wediscoveredthatcertaintasksdidnotalignwellwiththepipelinewedescribedearlier.Specifically,forcausal,counterfactual,spatiallayoutandnavigationtasks,wefounditmoreeffectivetomanuallygeneratequestionswithhumanexpertsratherthanprocessingthroughourmulti-stagepipeline.Consequently,forthesetasksinourbenchmark,wegeneratedhigh-qualityquestions,albeitinasmallerquantity.Fourhumanexpertswereengagedinthisstage,generatingatotalof658MCQs(5.1%).
Implementationdetails.WeusedGPT-4inourpipelineasitoffersimpressivecapabilitiestofollowcomplexmulti-stepinstructions.WeusedtheChain-of-Thought[
20
]promptingstrategyandatemperatureof0.1forallstagesinvolvingLLMsinourpipeline.WeshowanexampleMCQlife-cycleinFig.
B.2.
SeeSupplementary
B
formoredetailsondatasetgeneration.WeincludetheexactpromptsusedforgeneratingMCQ2forthefollowingtasks:?Narrationcompilation(Fig.
E.1
),?Summarization(Fig.
E.2,
E.3
),?Perception/InformationRetrieval/FactualRecall(Fig.
E.4,
E.5
),
?VisualReasoning/Spatial/Relationship(Fig.
E.6,
E.7)
.
2.3HourVideoStatistics
HourVideoconsistsof500videosfromtheEgo4Ddataset,covering77dailylifescenariossuchascooking,cleaning,eating,watchingTV,baking,etc.(Fig.
3)
.Thedatasetincludes381hoursofvideofootage,withvideodurationsrangingfrom20to120minutes(Figure
3)
.Onaverage,eachvideoisapproximately45.7minuteslong,
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年中國(guó)南國(guó)仙桃香精市場(chǎng)調(diào)查研究報(bào)告
- 2025至2031年中國(guó)電啟動(dòng)超越離合器總成行業(yè)投資前景及策略咨詢研究報(bào)告
- 2025年化工廠技術(shù)員就業(yè)保障及培訓(xùn)合同
- 2025年度文化演出活動(dòng)策劃代理居間合同
- 2025年度工業(yè)燃料油國(guó)際貿(mào)易稅收籌劃合同
- 2025年度市場(chǎng)營(yíng)銷策劃師雇傭合同模板
- 2025年度建筑工程竣工驗(yàn)收合同范本匯編
- 2025年度合伙人經(jīng)營(yíng)合同協(xié)議書:物聯(lián)網(wǎng)技術(shù)應(yīng)用合作專用協(xié)議
- 2025年度護(hù)理專業(yè)實(shí)習(xí)基地建設(shè)與運(yùn)營(yíng)合同
- 2025年度婚內(nèi)出軌離婚協(xié)議執(zhí)行監(jiān)督與變更服務(wù)合同
- 2025年中考物理終極押題猜想(新疆卷)(全解全析)
- 脛骨骨折的護(hù)理查房
- 抽水蓄能電站項(xiàng)目建設(shè)管理方案
- 電動(dòng)工具培訓(xùn)課件
- 《智能網(wǎng)聯(lián)汽車智能傳感器測(cè)試與裝調(diào)》電子教案
- 視頻會(huì)議室改造方案
- 【中考真題】廣東省2024年中考語文真題試卷
- GB/T 32399-2024信息技術(shù)云計(jì)算參考架構(gòu)
- 2025年湖南省長(zhǎng)沙市中考數(shù)學(xué)模擬試卷(附答案解析)
- 五級(jí)人工智能訓(xùn)練師(初級(jí))職業(yè)技能等級(jí)認(rèn)定考試題庫(含答案)
- 2022年內(nèi)蒙古呼和浩特市中考化學(xué)真題(解析版)
評(píng)論
0/150
提交評(píng)論