基于生成模型的高保真圖像與視頻編輯 High-fidelity image and video editing with generative models_第1頁
基于生成模型的高保真圖像與視頻編輯 High-fidelity image and video editing with generative models_第2頁
基于生成模型的高保真圖像與視頻編輯 High-fidelity image and video editing with generative models_第3頁
基于生成模型的高保真圖像與視頻編輯 High-fidelity image and video editing with generative models_第4頁
基于生成模型的高保真圖像與視頻編輯 High-fidelity image and video editing with generative models_第5頁
已閱讀5頁,還剩199頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領

文檔簡介

HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS

by

CHENYANGQI

AThesisSubmittedto

TheHongKongUniversityofScienceandTechnology

inPartialFul?llmentoftheRequirementsfor

theDegreeofDoctorofPhilosophy

inComputerScienceandEngineering

July2024,HongKong

Copyright?byChenyangQi2024

HKUSTLibrary

Reproductionisprohibitedwithouttheauthor’spriorwrittenconsent

ii

Authorization

IherebydeclarethatIamthesoleauthorofthethesis.

IauthorizetheHongKongUniversityofScienceandTechnologytolendthisthesistootherinstitutionsorindividualsforthepurposeofscholarlyresearch.

IfurtherauthorizetheHongKongUniversityofScienceandTechnologytoreproducethethesisbyphotocopyingorbyothermeans,intotalorinpart,attherequestofotherinstitutionsorindividualsforthepurposeofscholarlyresearch.

CHENYANGQI

24July2024

iii

HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS

by

CHENYANGQI

ThisistocertifythatIhaveexaminedtheabovePh.D.thesis

andhavefoundthatitiscompleteandsatisfactoryinallrespects,

andthatanyandallrevisionsrequiredby

thethesisexaminationcommitteehavebeenmade.

Prof.QifengChen,ThesisSupervisor

Prof.XiaofangZHOU,HeadofDepartment

DepartmentofComputerScienceandEngineering

24July2024

iv

ACKNOWLEDGMENTS

ItwouldhavebeenimpossibletocompletemywonderfulPh.D.journeywithoutthehelpofsomanypeople.

Firstofall,Iwouldliketoexpressmygratitudetomyadvisor,ProfessorQifengChen,forhispatience,support,andencouragement.Istillrememberour?rstmeetingfouryearsago.AlthoughIhadalmostnoexperienceincomputervisionatthattime,ProfessorChenbelievedinmypotentialandgavemethisinvaluableopportunitytopursueknowledgeatHKUST.Infouryears,hehasprovidedkindguidancetome:ideabrainstorming,technicaldesign,resultpresentation,andcareerplanning.

Secondly,Iwouldliketothankmymentorsduringmyinternships:XiaodongCun,YongZhang,XintaoWang,andYingShanatTencentAILab;ZhengzhongTu,KerenYe,HosseinTalebi,MauricioDelbracio,andPeymanMilanfaratGoogleResearch;BoZhang,DongChen,andFangWenatMicrosoftResearch;TaesungParkandJimeiYangatAdobe.Theytaughtmepracticalskillstosolvereal-worldproblemsandbridgethegapbetweenacademiaandindustry.

Next,IwouldliketothankmylabmatesintheHKUSTVisualIntelligenceLab,especiallymycollaboratorsChenyangLei,JiaxinXie,XinYang,KaLeongCheng,YueMa,LiyaJi,JunmingChen,NaFan,andZianQian.Wehavehelpedeachotherinourresearch,andIhavelearnedalotfromtheirinsights.Also,thankstoYueWu,QiangWen,TengfeiWang,YingqingHe,YazhouXing,GuotaoMeng,ZifanShi,MaoshengYe,YueqiXie,andallotherlabmates.Ithasbeenajoyfultimebeingfriendsandpartnerswithyou.

Further,IwouldliketoexpressmysinceregratitudetoProf.YincongChen,Prof.DanXu,andProf.XiaomengLi,Prof.Chi-YingTsui,Prof.LingShi,Prof.ChiewLanTai,andProfYinqiangZhengwhoservedonthequalifyingexaminationcommitteeandthesiscommitteeofmyPh.D.programatHKUST.

Lastbutnotleast,Iappreciatetheendlesssupportfrommyfamilyandmygirlfriend.Yourencouragementhasgivenmethepowertofacethedif?cultiesinmyresearch.MygirlfriendXilinZhanghasalsohelpedinrevisingmydraftsbeforealmosteverydeadline.

Thankstoeveryonewhohasofferedtheirkindsupportandhelpinmyacademicjourney!

v

TABLEOFCONTENTS

TitlePagei

AuthorizationPageii

SignaturePageiii

Acknowledgmentsiv

TableofContentsv

ListofFiguresviii

ListofTablesxii

Abstractxiv

Chapter1Introduction1

1.1Background1

1.2DissertationOverview4

Chapter2ThumbnailRescalingusingquantizedautoencoder6

2.1Introduction6

2.2RelatedWork10

2.2.1ImageSuper-resolution10

2.2.2ImageRescaling10

2.2.3ImageCompression11

2.3Method11

2.3.1JPEGPreliminary11

2.3.2OverviewofHyperThumbnail13

2.3.3QuantizationPredictionModule13

2.3.4Frequency-awareDecoder14

2.3.5TrainingObjectives14

vi

2.4Experiments16

2.4.1ImplementationDetails16

2.4.2ExperimentalSetup18

2.4.3ComparewithBaselines19

2.4.4Additionalqualitativeresults23

2.4.5Real-timeInferenceon6KImages27

2.4.6ExtensionforOptimization-basedRescaling27

2.5AblationStudy28

2.6Conclusion30

Chapter3Text-drivenimagerestorationviadiffusionpriors31

3.1Introduction32

3.2RelatedWork34

3.3Method36

3.3.1Preliminaries37

3.3.2Text-drivenImageRestoration37

3.3.3DecouplingSemanticandRestorationPrompts38

3.3.4LearningtoControltheRestoration40

3.4Experiments43

3.4.1Text-basedTrainingDataandBenchmarks43

3.4.2Comparingwithbaselines44

3.4.3PromptingtheSPIRE46

3.5AblationStudy47

3.6Conclusion47

Chapter4Text-drivenvideoeditingusingdiffusionpriors52

4.1Introduction53

4.2RelatedWork55

4.3Methods56

4.3.1Preliminary:LatentDiffusionandInversion57

4.3.2FateZeroVideoEditing59

4.3.3Shape-AwareVideoEditing62

4.4Experiments62

vii

4.4.1ImplementationDetails62

4.4.2Pseudoalgorithmcode63

4.4.3Applications64

4.4.4BaselineComparisons66

4.4.5AblationStudies68

4.5Conclusion69

Chapter5ConclusionandDiscussion72

References74

AppendixAListofPublications91

viii

LISTOFFIGURES

1.1Traditionalparadigm[107,146](a)ofvisualediting?rstconductsdegradationop-eratorsontrainingdataxtosynthesizeconditionsy,suchaslow-resolutionimages,segmentationmaps,orsketchmaps.Althoughthismethodisstraightforward,itfacesdif?cultiesincollectingopen-domainpairedtrainingdataanddesigninga?exibleframeworktounifyalltranslationtasks.(b)Weproposeanewparadigmutilizingpretrainedgenerativemodelsandconditionedoneditinginstructiontoadapttovariouseditingtasks?exibly.

2.1Theapplicationof6Kimagerescalinginthecontextofcloudphotostorageonsmartphones(e.g.,iCloud).Asmorehigh-resolution(HR)imagesareuploadedtocloudstoragenowadays,challengesarebroughttocloudserviceproviders(CSPs)inful?llinglatency-sensitiveimagereadingrequests(e.g.,zoom-in)throughtheinternet.Tofacilitatefastertransmissionandhigh-qualityvisualcontent,ourHy-perThumbnailframeworkhelpsCSPstoencodeanHRimageintoanLRJPEGthumbnail,whichuserscouldcachelocally.Whentheinternetisunstableorun-available,ourmethodcanstillreconstructahigh-?delityHRimagefromtheJPEGthumbnailinrealtime.

2.2Theoverviewofourapproach.GivenanHRinputimagex,we?rstencodextoitsLRrepresentationywiththeencoderE,wherethescalingfactoriss.Second,wetransformytoDCTcoef?cientsCandpredictthequantizationtablesQL,QCwith

ourquantizationpredictionmodule(QPM).toestimatethebitrateofthequantizedcoef?cientsCattrainingstage.roundingandtruncation,whichwedenotedas[·],the[QL],[QC]and[C]canbewrittenandreadwithoff-the-shelfJPEGAPIatthetestingstage.TorestoretheHR,weextractfeaturesfromCwithafrequencyfeatureextractorfandproducethehigh-?delityimagewiththedecoderD.

2.3ReconstructedHRimagesandLRthumbnailsbydifferentmethodsontheDIV2K[6]validationdataset.WecroptherestoredHRimagestoeasethecomparisonandvisualizetheLRcounterpartsatthebottom-right.ThebppiscalculatedonthewholeimageandthePSNRisevaluatedonthecroppedareaofthereconstructedHRimages.

2.4DownscaledLRthumbnailsbydifferentmethodsonSet14imagecomic.Withasimilartargetbpp,ourmodelintroducesleastartifactsinthethumbnailincompar-isontobaselines.

2.5Modelruntime.Wepro?lethe4×encoderanddecoderatdifferenttargetresolu-tioninhalf-precisionmode.Especially,weconvertourdecoderfromPyTorchtoTensorRTforfurtherinferencetimereduction.

3

7

12

17

20

21

ix

2.6Therate-HR-distortioncurveonKodak[1]dataset.Ourmethod(s=2,4)outperformsJPEG,IRN[153]intheRDperformance.Forthe‘QPM+JPEG’curve,wheres=1,wefollowthestandardJPEGalgorithmandadoptQPMmoduleasapluginfortableprediction.

2.7Visualresultsofperforming4×rescalingontheDIV2K[6]andFiveK[18]datasetswithbaselinemethodsandourmodels.Theimagesarecroppedtoeasethecomparison.Pleasezoominfordetails.

2.8Moreresultsof4×rescalingwithourframeworkonreal-world6Kimages[18].Pleasezoominfordetails.Notethattheimagesherearecompressedduetothesizelimitofcamera-ready.

2.9QuantizationtablesonKodak[1]images.WevisualizethequantizationtableQL(thegreentable)andQC(theorangetable)forkodim04andkodim09ofdifferentquantizationapproaches.ThemodeltrainedwithQPMachievesthebestRDperformancefromeveryaspect.Formoreanalysis,pleaserefertoSec.2.5inourchapter.

2.10QPMversusimage-invariantquantization.We?rsttrainourmodelswithQPM,witha?xedJPEGtableorwithanoptimizedtable,respectively.Then,weevaluatetheatdifferenttargetbitrateonKodak[1]dataset.(a)theRDcurveonreconstructed

HRimageandinputx;(b)theRDcurveonLRthumbnailandtheBicubic

downsampledLRyref.

2.11guidancelossablationonKodak[1]imagekodim17.WevisualizetheHRimageswiththeirLRcounterpartsatthebottom-right.(b)(c)areproducedby4×HyperThumbnailmodelstrainedwithdifferentλ1andthebppis0.4.

3.1WepresentSPIRE:SemanticPrompt-DrivenImageRestoration,atext-basedfoundationmodelforall-in-one,instructedimagerestoration.SPIREallowsusersto?exiblyleverageeithersemantic-levelcontentprompt,ordegradation-awarerestorationprompt,orboth,toobtaintheirdesiredenhancementresultsbasedonpersonalpreferences.Inotherwords,SPIREcanbeeasilypromptedtoconductblindrestoration,semanticrestoration,ortask-speci?cgranulartreatment.Ourframeworkalsoenablesanewparadigmofinstruction-basedimagerestoration,providingareliableevaluationbenchmarktofacilitatevision-languagemodelsforlow-levelcomputationalphotographyapplications.

3.2FrameworkofSPIRE.Inthetrainingphase,webeginbysynthesizingadegradedversiony,ofacleanimagex.Ourdegradationsynthesispipelinealsocreatesarestorationpromptcr,whichcontainsnumericparametersthatre?ectstheintensityofthedegradationintroduced.Then,weinjectthesyntheticrestorationpromptintoaControlNetadaptor,whichusesourproposedmodulationfusionblocks(γ,β)toconnectwiththefrozenbackbonedrivenbythesemanticpromptcs.Duringtesttime,theuserscanemploytheSPIREframeworkaseitherablindrestorationmodelwithrestorationprompt“Removealldegradation”andemptysemanticprompt?,ormanuallyadjusttherestorationcrandsemanticpromptscstoobtainwhattheyaskfor.

23

25

26

27

29

29

31

35

x

3.3Degradationambiguitiesinreal-worldproblems.Byadjustingtherestorationprompt,ourmethodcanpreservethemotioneffectthatiscoupledwiththeaddedGaussianblur,whilefullyblindrestorationmodelsdonotprovidethislevelof?exibility.

3.4Promptspacewalkingvisualizationfortherestorationprompt.Giventhesamedegradedinput(upperleft)andemptysemanticprompt?,ourmethodcandecoupletherestorationdirectionandstrengthviaonlypromptingthequantitativenumberinnaturallanguage.Aninteresting?ndingisthatourmodellearnsacontinuousrangeofrestorationstrengthfromdiscretelanguagetokens.

3.5Restorationpromptingforout-of-domainimages.

3.6VisualComparisonwithotherbaselines.Ourmethodofintegratingboththesemanticpromptcsandtherestorationpromptcroutperformsimge-to-imagerestoration(DiffBIR,RetrainedControlNet-SR)andnaivezero-shotcombinationwithsemanticprompt.Itachievesmoresharpandcleanresultswhilemaintainingconsistencywiththedegradedimage.

3.7Test-timesemanticprompting.Ourframeworkrestoresdegradedimagesguidedby?exiblesemanticprompts,whileunrelatedbackgroundelementsandglobaltonesremainalignedwiththedegradedinputconditioning.Inaddition,Moresemanticpromptingforimageswithmultipleobjects

3.8Mainvisualcomparisonwithbaselines.(Zoominfordetails)

4.1Zero-shottext-drivenvideoediting.Wepresentazero-shotapproachforshape-awarelocalobjecteditingandvideostyleeditingfrompre-traineddiffusionmod-els[150,117]withoutanyoptimizationforeachtargetprompt.

4.2Theoverviewofourapproach.Ourinputistheuser-providedsourcepromptpsrc,targetpromptpeditandcleanlatentz={z1,z2,...zn}encodedfrominputsourcevideox={x1,x2,...xn}withnumberframesninavideosequence.Ontheleft,we?rstinvertthevideousingDDIMinversionpipelineintonoisylatentzTusingthesourcepromptpsrcandanin?ated3DU-Netεθ.Duringeachinversiontimestept,

westorebothspatial-temporalself-attentionmapssandcross-attentionmapsc.

AttheeditingstageoftheDDIMdenoising,wedenoisethelatentzTbacktoclean

image0conditionedontargetpromptpedit.Ateachdenoisingtimestept,wefuse

theattentionmaps(sandc)inεθwithstoredattentionmap(s,c)using

theproposedAttentionBlendingBlock.Right:Speci?cally,wereplacethecross-

attentionmapscofun-editedwords(e.g.,roadandcountryside)withsource

mapscofthem.Inaddition,weblendtheself-attentionmapduringinversion

sandeditingswithanadaptivespatialmaskobtainedfromcross-attention

mapscofeditedwords(e.g..,silverandjeep),whichrepresentstheareasthatthe

userwantstoedit.

4.3Zero-shotlocalattributedediting(cat→tiger)usingstablediffusion.Incontrasttofusionwithattentionduringreconstruction(a)inpreviouswork[49,136,108],ourinversionattentionfusion(b)providesmoreaccuratestructureguidanceandeditingability,asvisualizedontherightside.

42

49

49

50

50

51

53

57

58

xi

4.4Studyofblendedself-attentioninzero-shotshapeediting(rabbit→tiger)usingstablediffusion.Forthand?fthcolumns:Ignoringself-attentioncannotpreservetheoriginalstructureandbackground,andnaivereplacementcausesartifacts.Thirdcolumn:Blendingtheself-attentionusingthecross-attentionmap(thesecondrow)obtainsbothnewshapefromthetargettextwithasimilarposeandbackgroundfromtheinputframe.

4.5Zero-shotobjectshapeeditingonpre-trainedvideodiffusionmodel[150]:Ourframeworkcandirectlyedittheshapeoftheobjectinvideosdrivenbytextpromptsusingatrainedvideodiffusionmodel[150]

4.6Zero-shotattributeandstyleeditingresultsusingStableDiffusion[117].Ourframeworksupportsabstractattributeandstyleeditinglike‘Swarovskicrystal’,‘Ukiyo-e’,and‘MakotoShinkai’.Bestviewedwithzoom-in.

4.7Qualitativecomparisonofourmethodswithotherbaselines.InputsareinFig.4.5andFig4.8.Ourresultshavethebesttemporalconsistency,image?delity,andeditingquality.Bestviewedwithzoom-in.

4.8Applicationoflatentsblending.Extendingourattentionblendingstrategytohigh-resolutionlatent,ourframeworkcanpreservetheaccuratelow-levelcolorandtextureofinput.

4.9Inversionattentioncomparedwithreconstructionattentionusingprompt‘desertedshore‘glaciershore’.Theattentionmapsobtainedfromtherecon-structionstagefailtodetecttheboat’sposition,andcannotprovidesuitablemotionguidanceforzero-shotvideoediting.

4.10Ablationstudyofblendedself-attention.Withoutself-attentionfusion,thegeneratedvideocannotpreservethedetailsofinputvideos(e.g.,fence,trees,andcaridentity).Ifwereplacefullself-attentionwithoutaspatialmask,thestructureoftheoriginaljeepmisleadsthegenerationofthePorschecar.

59

62

63

64

65

67

69

xii

LISTOFTABLES

1.1Thecomparisonofdifferentgenerativemodels.

2.1Thecomparisonofdifferentmethodsrelatedtoimagerescaling.(a)Super-resolutionfromdownsampledJPEGdoesnotoptimizerate-distortionperformanceandcanhardlymaintainhigh?delityduetoinformationlostindownsampling.(b)SOTA?ow-basedimagerescalingmethodsalsoignorethe?lesizeconstraintsand

arenotreal-timefor6Kreconstructionduetothelimitedspeedofinvertiblenetworks.(c)Ourframeworkoptimizesrate-distortionperformancewhilemaintaininghigh-?delityandreal-time6Kimagerescaling.

2.2Quantitativeevaluationofupscalingef?ciencyandreconstruction?delity.Wekeepbpparound0.3onKodak[1]fordifferentmethods,andthedistortionismeasuredbythePSNRonthereconstructedHRimages.OurapproachoutperformsothermethodswithbetterHRreconstructionandasigni?cantlylowerruntime.WemeasuretherunningtimeandGMacsofallmodelsbyupscalinga960×540LRimagetoa3840×2160HRimage.ThemeasurementsaremadeonanNvidiaRTX

3090GPUwithPyTorch-1.11.0inhalf-precisionmodeforafaircomparison.

2.3Architecturesofourencoder.

2.4Architecturesofouref?cientdecoder.

2.5Quantitativeevaluationofthe4×downsampledLRthumbnailsbydifferentmethods.Thetargetbitrateisaround0.3bpponKodak[1]forallmethods,andwetakeBicubicLRasthegroundtruth.Ourthumbnailpreservesvisualcontentsbetter.

2.6ComparisonofourHyperThumbnailframeworkagainstlearnedcompressionwithJPEGthumbnail.Inadditionalbaseline,weprovideaJPEGthumbnailbesides learnedcompression,andtakethesumofbitstreamsizeandJPEGsizetocal-culatethe?nalbpp.Ourframeworkhasbetterrate-distortionperformancethan“Compression+JPEG”baseline.

2.7Ablationstudyofourencoder-decoderarchitecturesonthedownsampling/upsam-plingtimeandthePSNRofreconstructedHRimage/LRthumbnail.

2.8Quantitativeevaluationforoptimization-basedrescaling.

2.9HRreconstructionPSNRwithdifferentdecodercapacity.

3.1QuantitativeresultsontheMS-COCOdataset(withcs)usingourparameterizeddegradation(left)andReal-ESRGANdegradation(right).Wealsodenotethepromptchoiceattesttime.‘Sem’standsforsemanticprompt;‘Res’standsforrestorationprompt.The?rstgroupofbaselinesaretestedwithoutprompt.Thesecondgrouparecombinedwithsemanticpromptinzero-shotway.

2

8

16

17

18

21

22

24

27

30

42

xiii

3.2Ourtrainingdegradationisrandomlysampledinthesetwopipelinewith50%each.(1)DegradedimagesysynthesizedbyReal-ESRGANarepairedwiththesamerestorationpromptcr=“Removealldegradation”(2)Inother50%iterations,imagesgeneratedbyourparameterizedpipelinearepairedwitheitherarestorationtypeprompt(e.g.,“Deblur”)orarestorationparameterprompt(e.g.,“Deblurwithsigma0.3;”).

3.3NumericalresultsonDIV2Ktestsetwithoutanyprompt.

3.4Ablationofarchitectureanddegradationstrengthincr

3.5Ablationofpromptsprovidedduringbothtrainingandtesting.Weuseanimage-to-imagemodelwithourmodulationfusionlayerasourbaseline.Providingsemanticpromptssigni?cantlyincreasestheimagequality(1.9lowerFID)andsemanticsimilarity(0.002CLIP-Image),butresultsinworsepixel-levelsimilarity.Incontrast,degradationtypeinformationembeddedinrestorationpromptsimprovesbothpixel-level?delityandimagequality.Utilizingdegradationparametersintherestorationinstructionsfurtherimprovesthesemetrics.

3.6Ablationofthearchitecture.Modulatingtheskipfeaturefskipimprovesthe?-delityoftherestoredimagewith3%extraparametersintheadaptor,whilefurthermodulatingthebackbonefeaturesfupdoesnotbringobviousadvantage.

4.1Quantitativeevaluationagainstbaselines.Inouruserstudy,theresultsofourmethodarepreferredoverthosefrombaselines.ForCLIP-Score,weachievethebesttemporalconsistencyandcomparableframewiseeditingaccuracy

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論