LM-Nav:基于語言、視覺和行為大模型的機器人導(dǎo)航方法_第1頁
LM-Nav:基于語言、視覺和行為大模型的機器人導(dǎo)航方法_第2頁
LM-Nav:基于語言、視覺和行為大模型的機器人導(dǎo)航方法_第3頁
LM-Nav:基于語言、視覺和行為大模型的機器人導(dǎo)航方法_第4頁
LM-Nav:基于語言、視覺和行為大模型的機器人導(dǎo)航方法_第5頁
已閱讀5頁,還剩29頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

arXiv:2207.04429v1[cs.RO]10Jul2022

LM-Nav:RoboticNavigationwithLargePre-TrainedModelsofLanguage,Vision,andAction

DhruvShah+β,B?aejOsiski+βω,BrianIchterγ,SergeyLevineβγβUCBerkeley,ωUniversityofWarsaw,γRoboticsatGoogle

Abstract:Goal-conditionedpoliciesforroboticnavigationcanbetrainedonlarge,unannotateddatasets,providingforgoodgeneralizationtoreal-worldset-tings.However,particularlyinvision-basedsettingswherespecifyinggoalsre-quiresanimage,thismakesforanunnaturalinterface.Languageprovidesamoreconvenientmodalityforcommunicationwithrobots,butcontemporarymethodstypicallyrequireexpensivesupervision,intheformoftrajectoriesannotatedwithlanguagedescriptions.Wepresentasystem,LM-Nav,forroboticnavigationthatenjoysthebene?tsoftrainingonunannotatedlargedatasetsoftrajectories,whilestillprovidingahigh-levelinterfacetotheuser.Insteadofutilizingalabeledinstructionfollowingdataset,weshowthatsuchasystemcanbeconstructeden-tirelyoutofpre-trainedmodelsfornavigation(ViNG),image-languageassocia-tion(CLIP),andlanguagemodeling(GPT-3),withoutrequiringany?ne-tuningorlanguage-annotatedrobotdata.WeinstantiateLM-Navonareal-worldmobilerobotanddemonstratelong-horizonnavigationthroughcomplex,outdoorenvi-ronmentsfromnaturallanguageinstructions.

Keywords:instructionfollowing,languagemodels,vision-basednavigation

1Introduction

Oneofthecentralchallengesinroboticlearningistoenablerobotstoperformawidevarietyoftasksoncommand,followinghigh-levelinstructionsfromhumans.Thisrequiresrobotsthatcanunderstandhumaninstructions,andareequippedwithalargerepertoireofdiversebehaviorstoexecutesuchinstructionsintherealworld.Priorworkoninstructionfollowinginnavigationhaslargelyfocusedonlearningfromtrajectoriesannotatedwithtextualinstructions[

1

5

].Thisenablesunderstandingoftextualinstructions,butthecostofdataannotationimpedeswideadoption.Ontheotherhand,recentworkhasshownthatlearningrobustnavigationispossiblethroughgoal-conditionedpoliciestrainedwithself-supervision.Theseutilizelarge,unlabeleddatasetstotrainvision-basedcontrollersviahindsightrelabeling[

6

11

].Theyprovidescalability,generalizability,androbustness,butusuallyinvolveaclunkymechanismforgoalspeci?cation,usinglocationsorimages.Inthiswork,weaimtocombinethestrengthsofbothapproaches,enablingaself-supervisedsystemforroboticnavigationtoexecutenaturallanguageinstructionsbyleveragingthecapabilitiesofpre-trainedmodelswithoutanyuser-annotatednavigationaldata.Ourmethodusesthesemodelstoconstructan“interface”thathumanscanusetocommunicatedesiredtaskstorobots.Thissystemenjoystheimpressivegeneralizationcapabilitiesofthepre-trainedlanguageandvision-languagemodels,enablingtheroboticsystemtoacceptcomplexhigh-levelinstructions.

Ourmainobservationisthatwecanutilizeoff-the-shelfpre-trainedmodelstrainedonlargecorporaofvisualandlanguagedatasets—thatarewidelyavailableandshowgreatfew-shotgeneraliza-tioncapabilities—tocreatethisinterfaceforembodiedinstructionfollowing.Toachievethis,wecombinethestrengthsoftwosuchrobot-agnosticpre-trainedmodelswithapre-trainednavigationmodel.Weuseavisualnavigationmodel(VNM:ViNG[

11

])tocreateatopological“mentalmap”oftheenvironmentusingtherobot’sobservations.Givenfree-formtextualinstructions,weusea

+Theseauthorscontributedequally,orderdecidedbyacoin?ip.Checkouttheprojectpageforexperimentvideos,code,andauser-friendlyColabnotebookthatrunsinyourbrowser:

/view/lmnav

2

Figure1:EmbodiedinstructionfollowingwithLM-Nav:Oursystemtakesasinputasetofrawobservationsfromthetargetenvironmentandfree-formtextualinstructions(left),derivinganactionableplanusingthreepre-trainedmodels:alargelanguagemodel(LLM)forextractinglandmarks,avision-and-languagemodel(VLM)forgrounding,andavisualnavigationmodel(VNM)forexecution.ThisenablesLM-Navtofollowtextualinstructionsincomplexenvironmentspurelyfromvisualobservations(right)withoutany?ne-tuning.

pre-trainedlargelanguagemodel(LLM:GPT-3[

12

])todecodetheinstructionsintoasequenceoftextuallandmarks.Wethenuseavision-languagemodel(VLM:CLIP[

13

])forgroundingthesetextuallandmarksinthetopologicalmap,byinferringajointlikelihoodoverthelandmarksandnodes.Anovelsearchalgorithmisthenusedtomaximizeaprobabilisticobjective,and?ndaplanfortherobot,whichisthenexecutedbyVNM.

OurprimarycontributionisLargeModelNavigation,orLM-Nav,anembodiedinstructionfollow-ingsystemthatcombinesthreelargeindependentlypre-trainedmodels—aself-supervisedroboticcontrolmodelthatutilizesvisualobservationsandphysicalactions(VNM),avision-languagemodelthatgroundsimagesintextbuthasnocontextofembodiment(VLM),andalargelanguagemodelthatcanparseandtranslatetextbuthasnosenseofvisualgroundingorembodiment(LLM)—toenablelong-horizoninstructionfollowingincomplex,real-worldenvironments.Wepresentthe?rstinstantiationofaroboticsystemthatcombinesthecon?uenceofpre-trainedvision-and-languagemodelswithagoal-conditionedcontroller,toderiveactionableplanswithoutany?ne-tuninginthetargetenvironment.Notably,allthreemodelsaretrainedonlarge-scaledatasets,withself-supervisedobjectives,andusedoff-the-shelfwithno?ne-tuning—nohumanannotationsoftherobotnavigationdataarenecessarytotrainLM-Nav.WeshowthatLM-Navisabletosuccess-fullyfollownaturallanguageinstructionsinnewenvironmentsoverthecourseof100sofmetersofcomplex,suburbannavigation,whiledisambiguatingpathswith?ne-grainedcommands.

2RelatedWork

Earlyworksinaugmentingnavigationpolicieswithnaturallanguagecommandsusestatisticalma-chinetranslation[

14

]todiscoverdata-drivenpatternstomapfree-formcommandstoaformallan-guagede?nedbyagrammar[

15

19

].However,theseapproachestendtooperateonstructuredstatespaces.Ourworkiscloselyinspiredbymethodsthatinsteadreducethistasktoasequencepredic-tionproblem[

1,

20,

21

].Notably,ourgoalissimilartothetaskofVLN—leveraging?ne-grainedinstructionstocontrolamobilerobotsolelyfromvisualobservations[

1,

2]

.

However,mostrecentapproachestoVLNusealargedatasetofsimulatedtrajectories—over1Mdemonstrations—annotatedwith?ne-grainedlanguagelabelsinindoor[

1,

3

5,

22

]anddriv-ingscenarios[

23

28

],andrelyonsim-to-realtransferfordeploymentinsimpleindoorenviron-ments[

29,

30

].However,thisnecessitatesbuildingaphoto-realisticsimulatorresemblingthetargetenvironment,whichcanbechallengingforunstructuredenvironments,especiallyforthetaskofoutdoornavigation.Instead,LM-Navleveragesfree-formtextualinstructionstonavigatearobotincomplex,outdoorenvironmentswithoutaccesstoanysimulationoranytrajectory-levelannotations.Recentprogressinusinglarge-scalemodelsofnaturallanguageandimagestrainedondiversedatahasenabledapplicationsinawidevarietyoftextual[

31

33

],visual[

13,

34

38

],andembodieddomains[

39

44

].Inthelattercategory,Shridharetal.[

39

],Khandelwaletal.[

44

]andJangetal.

[40

]?ne-tuneembeddingsfrompre-trainedmodelsonrobotdatawithlanguagelabels,Huangetal.

[41

]assumethatthelow-levelagentcanexecutetextualinstructions(withoutaddressingcontrol),

3

Text

Encoder

distance

CurrentObservation

I?TI?TI?T…I?T

1122131M

I

1

I

2

I

3

IN

actions

I?TI?TI?T…I?T

2122231M

ViT-L

ImageEncoder

I?TI?TI?T…I?T

3132331M

…………

(b)ViNGVNM

andAhnetal.[

42

]assumesthattherobothasasetoftext-conditionedskillsthatcanfollowatomictextualcommands.Alloftheseapproachesrequireaccesstolow-levelskillsthatcanfollowrudi-mentarytextualcommands,whichinturnrequireslanguageannotationsforroboticexperienceandastrongassumptionontherobot’scapabilities.Incontrast,wecombinethesepre-trainedvisionandlanguagemodelswithpre-trainedvisualpoliciesthatdonotuseanylanguageannotations[

11,

45]

without?ne-tuningthesemodelsinthetargetenvironmentorforthetaskofVLN.

Data-drivenapproachestovision-basedmobilerobotnavigationoftenusephotorealisticsimula-tors[

46

49

]orsuperviseddatacollection

[50

]tolearngoal-reachingpoliciesdirectlyfromrawobservations.Self-supervisedmethodsfornavigation[

6

11,

51

]insteadcanuseunlabeleddatasetsoftrajectoriesbyautomaticallygeneratinglabelsusingonboardsensorsandhindsightrelabeling.Notably,suchapolicycanbetrainedonlarge,diversedatasetsandgeneralizetopreviouslyunseenenvironments[

45,

52

].Beingself-supervised,suchpoliciesareadeptatnavigatingtodesiredgoalsspeci?edbyGPSlocationsorimages,butareunabletoparsehigh-levelinstructionssuchasfree-formtext.LM-Navusesself-supervisedpoliciestrainedinalargenumberofpriorenvironments,augmentedwithpre-trainedvisionandlanguagemodelsforparsingnaturallanguageinstructions,anddeploystheminnovelreal-worldenvironmentswithoutany?ne-tuning.

3Preliminaries

LM-Navconsistsofthreelarge,pre-trainedmodelsforprocessinglan-guage,associatingimageswithlan-guage,andvisualnavigation.

Largelanguagemodelsaregener-ativemodelsbasedontheTrans-formerarchitecture[

53

],trainedonlargecorporaofinternettext.LM-NavusestheGPT-3LLM

[12

],toparsetextualinstructionsintoase-quenceoflandmarks.

Vision-and-languagemodelsrefertomodelsthatcanassociateimages

aphotoofa

stopsign

T1

T2

T3

TM

CommandedSubgoal

IN?T1IN?T2IN?T3…IN?T

M

(a)

CLIPVLM

Figure2:LM-NavusesVLMtoinferajointprobabilitydistribu-tionovertextuallandmarksandimageobservations.VNMconsti-tutesanimage-conditioneddistancefunctionandpolicythatcancontroltherobot.

andtext,e.g.imagecaptioning,visualquestion-answering,etc.[

54

56]

.WeusetheCLIPVLM

[13

],amodelthatjointlyencodesimagesandtextintoanembeddingspacethatallowsittodeterminehowlikelysomestringistobeassociatedwithagivenimage.WecanjointlyencodeasetoflandmarkdescriptionstobtainedfromtheLLMandasetofimagesiktoobtaintheirVLMembeddings{T,Ik}(seeFig.

3

).Computingthecosinesimilaritybetweentheseembeddings,fol-lowedbyasoftmaxoperationresultsinprobabilitiesP(ik|t),correspondingtothelikelihoodthatimageikcorrespondstothestringt.LM-Navusesthisprobabilitytoalignlandmarkdescriptionswithimages.

Visualnavigationmodelslearnnavigationbehaviorandnavigationalaffordancesdirectlyfromvi-sualobservations[

11,

51,

57

59

],associatingimagesandactionsthroughtime.WeusetheViNGVNM

[11

],agoal-conditionedmodelthatpredictstemporaldistancesbetweenpairsofimagesandthecorrespondingactionstoexecute(seeFig.

3

).Thisprovidesaninterfacebetweenimagesandembodiment.TheVNMservestwopurposes:(i)givenasetofobservationsinthetargetenviron-ment,thedistancepredictionsfromtheVNMcanbeusedtoconstructatopologicalgraphg(V,E)thatrepresentsa“mentalmap”oftheenvironment;(ii)givena“walk”,comprisingofasequenceofconnectedsubgoalstoagoalnode,theVNMcannavigatetherobotalongthisplan.Thetopologicalgraphgisanimportantabstractionthatallowsasimpleinterfaceforplanningoverpastexperienceintheenvironmentandhasbeensuccessfullyusedinpriorworktoperformlong-horizonnaviga-tion[

52,

60,

61

].Todeduceconnectivitying,weuseacombinationoflearneddistanceestimates,temporalproximity(duringdatacollection),andspatialproximity(usingGPSmeasurements).Foreveryconnectedpairofvertices{vi,vj},weassignthisdistanceestimatetothecorrespondingedgeweightD(vi,vj).Formoredetailsontheconstructionofthisgraph,seeAppendix

B.

4

Weformulatethetaskofinstruc-tionfollowingonthegraphasthatofmaximizingtheprobabilityofsuccessfullyexecutingawalkthatmatchestheinstruction.AswewilldiscussinSection

4.2

,we?rstparsetheinstructionintoalistoflandmarks

=l1,l2,...,lnthatshouldbevis-

itedinorder.RecallthattheVNMisusedtobuildatopologicalgraphthatrepresentstheconnectivityoftheen-vironmentfrompreviouslyseenob-servations,withnodes{vi}corre-spondingtopreviouslyseenimages.

Forawalk=v1,v2,...,vT,we

factorizetheprobabilitythatitcorre-

Figure3:Systemoverview:(a)VNMusesagoal-conditioneddistancefunctiontoinferconnectivitybetweenthesetofrawobservationsandconstructsatopologicalgraph.(b)LLMtranslatesnaturallanguageinstruc-tionsintoasequenceoftextuallandmarks.(c)VLMinfersajointprobabilitydistributionoverthelandmarkdescriptionsandnodesinthegraph,whichisusedby(d)agraphsearchalgorithmtoderivetheoptimalwalkthroughthegraph.(e)TherobotdrivesfollowingthewalkintherealworldusingtheVNMpolicy.

4LM-Nav:InstructionFollowingwithPre-TrainedModels

LM-Navcombinesthecomponentsdiscussedearliertofollowtextualinstructionsintherealworld.

TheLLMparsesfree-forminstructionsintoalistoflandmarks(Sec.

4.2

),theVLMassociates

theselandmarkswithnodesinthegraphbyestimatingtheprobabilitythateachnodecorresponds

toeachPl(|)(Sec.

4.3

),andtheVNMisthenusedtoinferhoweffectivelytherobotcannavigate

betweeneachpairofnodesinthegraph,whichweconvertintoaprobabilityP(vi,vj)derivedfromtheestimatedtemporaldistances.To?ndtheoptimal“walk”onthegraphthatboth(i)adherestotheprovidedinstructionsand(ii)minimizestraversalcost,wederiveaprobabilisticobjective(Sec.

4.1)

andshowhowitcanbeoptimizedusingagraphsearchalgorithm(Sec.

4.4

).ThisoptimalwalkisthenexecutedintherealworldbyusingtheactionsproducedbytheVNMmodel.

4.1ProblemFormulation

Algorithm1:GraphSearch

1:Input:Landmarks(l1,l2,...,ln).

2:Input:Graphg(V,E).

3:Input:StartingnodeS.

4:Vi=0,...,nQ[li,v]=_o

v=V

5:Q[0,S]=0

6:Dijkstraalgorithm(g,Q[0,*])

7:foriin1,2,...,ndo

8:Vv=VQ[i,v]=Q[i_1,v]+CLIP(li,v)

9:Dijkstraalgorithm(g,Q[i,*])

10:endfor

11:destination=argmax(Q[n,*])

12:returnbacktrack(destination,Q[n,*])

spondstothegiveninstructioninto:(i)Pl,theprobabilitythatthewalkvisitsalllandmarksfromthedescription;(ii)Pt,theprobabilitythatthewalkcanbeexecutedsuccessfully.Let=l1,l2,...,lnbethelistoflandmarksdescribedinthenaturallanguageinstructions,andletP(li|vj)denotetheprobabilitythatnodevjcorrespondstothelandmarkdescriptionli.Thenwehave:

Pl(|)=1≤t1≤t≤tn≤TUP(lk|vtk),(1)

1≤k≤n

wheret1,t2,...,tnisassignmentofasubsequenceofwalk’snodetolandmarkdescriptions.

5

ToobtaintheprobabilityPt(),wemustconvertthedistanceestimatesprovidedbytheVNMmodel

intoprobabilities.Thishasbeenstudiedintheliteratureongoal-conditionedpolicies[

62,

63

].AsimplemodelbasedonadiscountedMDPformulationistomodeltheprobabilityofsuccessfullyreachingthegoalasγtothepowernumberoftimesteps,whichcorrespondstoaprobabilityofterminationof1_γateachtimestep.Wethenhave

Pt()=ⅡP(vj,vj+1)=ⅡγD(vj,vj+1),(2)

1≤j<n1≤j<n

whereD(vj,vj+1)referstothelength(inthenumberoftimesteps)oftheedgebetweennodesvjandvj+1,whichisprovidedbytheVNMmodel.The?nalprobabilisticobjectivethatoursystemneedstomaximizebecomes:

PM()=Pt()Pl(|)=ⅡγD(vj,vj+1)

1≤j<n

1≤t1≤x.≤tn≤tⅡP(lk|vtk).(3)

1≤k≤n

4.2ParsingFree-FormTextualInstructions

Theuserspeci?estheroutetheywanttherobottotakeusingnaturallanguage,whiletheobjectiveaboveisde?nedintermsofasequenceofdesiredlandmarks.Toextractthissequencefromtheuser’snaturallanguageinstructionweemployastandardlargelanguagemodel,whichinourprototypeisGPT-3[

12

].Weusedapromptwith3examplesofcorrectlandmarks’extractions,followedupbythedescriptiontobetranslatedbytheLLM.Suchanapproachworkedfortheinstructionsthatwetestediton.ExamplesofinstructionstogetherwithlandmarksextractedbythemodelcanbefoundinFig.

4.

Theappropriateselectionoftheprompt,includingthose3examples,wasrequiredformorenuancedcases.Fordetailsofthe“promptengineering”pleaseseeAppendix

A.

4.3VisuallyGroundingLandmarkDescriptions

AsdiscussedinSec.

4.1

,acrucialelementofselectingthewalkthroughthegraphiscomputingP(li|vj),theprobabilitythatlandmarkdescriptionlireferstonodevj(seeEquation

1

).Witheachnodecontaininganimagetakenduringinitialdatacollection,theprobabilitycanbecomputedusingCLIP[

13

]inthewaydescribedinSec.

3

astheretrievaltask.AspresentedinFig.

2,toemploy

CLIPtocomputeP(li|vj),weusetheimageatnodevjandcaptionpromptsintheformof“Thisisaphotoofa[li]”.TheresultingprobabilityP(li|vj),togetherwiththeinferrededges’distanceswillbeusedtoselecttheoptimalwalkinthegraph.

4.4GraphSearchfortheOptimalWalk

AsdescribedinSec.

4.1,

LM-Navaimsat?ndingawalk=(v1,v2,...,vT)thatmaximizestheprobabilityofsuccessfulexecutionthatadherestothegiveninstructions.WeformalizedthisprobabilityPMde?nedbyEqn.

3

.Wecande?neafunctionR(,)foramonotonicallyincreasingsequenceofindices=(t1,t2,...,tn):

n

T—1

R(,):=logP(li|vti)_αD(vj,vj+1),whereα=_logγ.(4)

whichhasthepropertythat()maximizesPMifandonlyifthereexistssuchthat,maximizesR.Inorderto?ndsuch,,weemploydynamicprogramming.Inparticularwede?neahelperfunctionQ(i,v)forie{0,1,...,n},veV:

Q(i,v)=

max

=(v1,v2,...,vj),vj=v

=(t1,t2,...,ti)

R(,).(5)

Q(i,v)representsthemaximalvalueofRforawalkendinginvthatvisitedthelandmarksuptoindexi.ThebasecaseQ(0,v)visitsnoneofthelandmarks,anditsvalueofRissimplyequaltominusthelengthofshortestpathfromnodeS.Fori>0wehave:

Q(i,v)=max╱Q(i_1,v)+logP(li|v),w∈nrs(v)Q(i,w)_α.D(v,w)、.(6)

6

Figure4:QualitativeexamplesofLM-Navinreal-worldenvironmentsexecutingtextualinstructions(left).ThelandmarksextractedbyLLM(highlightedintext)aregroundedintovisualobservationsbyVLM(center;overheadimagenotavailabletotherobot).TheresultingwalkofthegraphisexecutedbyVNM(right).

ThebasecaseforDPistocomputeQ(0,V).Then,ineachstepofDPi=1,2,...,nwecomputeQ(i,v).ThiscomputationresemblestheDijkstraalgorithm([

64

]).Ineachiteration,wepickthenodevwiththelargestvalueofQ(i,v)andupdateitsneighborsbasedontheEqn.

6

.Algorithm

1

summarizesthissearchprocess.Theresultofthisalgorithmisawalk=(v1,v2,...,vT)that

maximizestheprobabilityofsuccessfullycarryingouttheinstruction.Givensuchawalk,VNMcanexecutethepathbyusingitsactionestimatestosequentiallynavigatetothesenodes.

5SystemEvaluation

WenowdescribeourexperimentsdeployingLM-Navinavarietyofoutdoorsettingstofollowhigh-levelnaturallanguageinstructionswithasmallgroundrobot.Forallexperiments,theweightsofLLM,VLM,andVNMarefrozen—thereisno?ne-tuningorannotationinthetargetenvironment.Weevaluatethecompletesystem,aswellastheindividualcomponentsofLM-Nav,tounderstanditsstrengthsandlimitations.OurexperimentsdemonstratetheabilityofLM-Navtofollowhigh-levelinstructions,disambiguatepaths,andreachgoalsthatareupto800maway.

5.1MobileRobotPlatform

WeimplementLM-NavonaClearpathJackalUGVplatform(seeFig.

1

(right)).Thesensorsuiteconsistsofa6-DoFIMU,aGPSunitforapproximatelocalization,wheelencodersforlocalodom-etry,andfront-andrear-facingRGBcameraswitha170О?eld-of-viewforcapturingvisualobser-vationsandlocalizationinthetopologicalgraph.TheLLMandVLMqueriesarepre-computedonaremoteworkstationandthecomputedpathiscommandedtotherobotwirelessly.TheVNMrunson-boardandonlyusesforwardRGBimagesandun?lteredGPSmeasurements.

5.2FollowingInstructionswithLM-Nav

Ineachevaluationenvironment,we?rstconstructthegraphbymanuallydrivingtherobotandcollectingimageandGPSobservations.ThegraphisconstructedautomaticallyusingtheVNMfromthisdata,andinprinciplesuchdatacouldalsobeobtainedfrompasttraversals,orevenwithautonomousexplorationmethods[

45]

.Oncethegraphisconstructed,therobotcancarryoutin-structionsinthatenvironment.Wetestedoursystemon20queries,inenvironmentsofvaryingdif?culty,correspondingtoatotalcombinedlengthofover6km.Instructionsincludeasetofprominentlandmarksintheenvironmentthatcanbeidenti?edfromtherobot’sobservations,e.g.traf?ccones,buildings,stopsigns,etc.

Fig.

4

showsqualitativeexamplesofthepathtakenbytherobot.Notethattheoverheadimageandspatiallocalizationofthelandmarksisnotavailabletotherobotandisshownforvisualizationonly.InFig.

4

(a),LM-Navisabletosuccessfullylocalizethesimplelandmarksfromitspriortraversaland?ndashortpathtothegoal.Whiletherearemultiplestopsignsintheenvironment,theobjectiveinEqn.

3

causestherobottopickthecorrectstopsignincontext,soastominimizeoveralltraveldistance.Fig.

4

(b)highlightsLM-Nav’sabilitytoparsecomplexinstructionswith

7

Gostraighttowardthewhitebuilding.Continuestraightpassingbyawhitetruckuntilyoureachastopsign.

multiplelandmarksspecifyingtheroute—despitethepossibilityofashorterroutedirectlytothe?nallandmarkthatignoresinstructions,therobot?ndsapaththatvisitsallofthelandmarksinthecorrectorder.

Disambiguationwithinstructions.SincetheobjectiveofLM-Navistofollowinstructions,andnotmerelytoreachthe?nalgoal,differentin-structionsmayleadtodifferenttraversals.Fig.

5

showsanexamplewheremodifyingtheinstruc-tioncandisambiguatemultiplepathstothegoal.Giventheshorterprompt(blue),LM-Navprefersthemoredirectpath.Onspecifyingamore?ne-grainedroute(magenta),LM-Navtakesanalter-natepaththatpassesadifferentsetoflandmarks.

Missinglandmarks.WhileLM-Naviseffectiveatparsinglandmarksfrominstructions,localizingthemonthegraph,and?ndingapathtothegoal,itreliesontheassumptionthatthelandmarks(i)existintheenvironment,and(ii)canbeidenti?edbytheVLM.Fig.

4

(c)illustratesacasewheretheexecutedpathfailsto

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論