版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領
文檔簡介
HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS
by
CHENYANGQI
AThesisSubmittedto
TheHongKongUniversityofScienceandTechnology
inPartialFul?llmentoftheRequirementsfor
theDegreeofDoctorofPhilosophy
inComputerScienceandEngineering
July2024,HongKong
Copyright?byChenyangQi2024
HKUSTLibrary
Reproductionisprohibitedwithouttheauthor’spriorwrittenconsent
ii
Authorization
IherebydeclarethatIamthesoleauthorofthethesis.
IauthorizetheHongKongUniversityofScienceandTechnologytolendthisthesistootherinstitutionsorindividualsforthepurposeofscholarlyresearch.
IfurtherauthorizetheHongKongUniversityofScienceandTechnologytoreproducethethesisbyphotocopyingorbyothermeans,intotalorinpart,attherequestofotherinstitutionsorindividualsforthepurposeofscholarlyresearch.
CHENYANGQI
24July2024
iii
HIGH-FIDELITYIMAGEANDVIDEOEDITINGWITHGENERATIVEMODELS
by
CHENYANGQI
ThisistocertifythatIhaveexaminedtheabovePh.D.thesis
andhavefoundthatitiscompleteandsatisfactoryinallrespects,
andthatanyandallrevisionsrequiredby
thethesisexaminationcommitteehavebeenmade.
Prof.QifengChen,ThesisSupervisor
Prof.XiaofangZHOU,HeadofDepartment
DepartmentofComputerScienceandEngineering
24July2024
iv
ACKNOWLEDGMENTS
ItwouldhavebeenimpossibletocompletemywonderfulPh.D.journeywithoutthehelpofsomanypeople.
Firstofall,Iwouldliketoexpressmygratitudetomyadvisor,ProfessorQifengChen,forhispatience,support,andencouragement.Istillrememberour?rstmeetingfouryearsago.AlthoughIhadalmostnoexperienceincomputervisionatthattime,ProfessorChenbelievedinmypotentialandgavemethisinvaluableopportunitytopursueknowledgeatHKUST.Infouryears,hehasprovidedkindguidancetome:ideabrainstorming,technicaldesign,resultpresentation,andcareerplanning.
Secondly,Iwouldliketothankmymentorsduringmyinternships:XiaodongCun,YongZhang,XintaoWang,andYingShanatTencentAILab;ZhengzhongTu,KerenYe,HosseinTalebi,MauricioDelbracio,andPeymanMilanfaratGoogleResearch;BoZhang,DongChen,andFangWenatMicrosoftResearch;TaesungParkandJimeiYangatAdobe.Theytaughtmepracticalskillstosolvereal-worldproblemsandbridgethegapbetweenacademiaandindustry.
Next,IwouldliketothankmylabmatesintheHKUSTVisualIntelligenceLab,especiallymycollaboratorsChenyangLei,JiaxinXie,XinYang,KaLeongCheng,YueMa,LiyaJi,JunmingChen,NaFan,andZianQian.Wehavehelpedeachotherinourresearch,andIhavelearnedalotfromtheirinsights.Also,thankstoYueWu,QiangWen,TengfeiWang,YingqingHe,YazhouXing,GuotaoMeng,ZifanShi,MaoshengYe,YueqiXie,andallotherlabmates.Ithasbeenajoyfultimebeingfriendsandpartnerswithyou.
Further,IwouldliketoexpressmysinceregratitudetoProf.YincongChen,Prof.DanXu,andProf.XiaomengLi,Prof.Chi-YingTsui,Prof.LingShi,Prof.ChiewLanTai,andProfYinqiangZhengwhoservedonthequalifyingexaminationcommitteeandthesiscommitteeofmyPh.D.programatHKUST.
Lastbutnotleast,Iappreciatetheendlesssupportfrommyfamilyandmygirlfriend.Yourencouragementhasgivenmethepowertofacethedif?cultiesinmyresearch.MygirlfriendXilinZhanghasalsohelpedinrevisingmydraftsbeforealmosteverydeadline.
Thankstoeveryonewhohasofferedtheirkindsupportandhelpinmyacademicjourney!
v
TABLEOFCONTENTS
TitlePagei
AuthorizationPageii
SignaturePageiii
Acknowledgmentsiv
TableofContentsv
ListofFiguresviii
ListofTablesxii
Abstractxiv
Chapter1Introduction1
1.1Background1
1.2DissertationOverview4
Chapter2ThumbnailRescalingusingquantizedautoencoder6
2.1Introduction6
2.2RelatedWork10
2.2.1ImageSuper-resolution10
2.2.2ImageRescaling10
2.2.3ImageCompression11
2.3Method11
2.3.1JPEGPreliminary11
2.3.2OverviewofHyperThumbnail13
2.3.3QuantizationPredictionModule13
2.3.4Frequency-awareDecoder14
2.3.5TrainingObjectives14
vi
2.4Experiments16
2.4.1ImplementationDetails16
2.4.2ExperimentalSetup18
2.4.3ComparewithBaselines19
2.4.4Additionalqualitativeresults23
2.4.5Real-timeInferenceon6KImages27
2.4.6ExtensionforOptimization-basedRescaling27
2.5AblationStudy28
2.6Conclusion30
Chapter3Text-drivenimagerestorationviadiffusionpriors31
3.1Introduction32
3.2RelatedWork34
3.3Method36
3.3.1Preliminaries37
3.3.2Text-drivenImageRestoration37
3.3.3DecouplingSemanticandRestorationPrompts38
3.3.4LearningtoControltheRestoration40
3.4Experiments43
3.4.1Text-basedTrainingDataandBenchmarks43
3.4.2Comparingwithbaselines44
3.4.3PromptingtheSPIRE46
3.5AblationStudy47
3.6Conclusion47
Chapter4Text-drivenvideoeditingusingdiffusionpriors52
4.1Introduction53
4.2RelatedWork55
4.3Methods56
4.3.1Preliminary:LatentDiffusionandInversion57
4.3.2FateZeroVideoEditing59
4.3.3Shape-AwareVideoEditing62
4.4Experiments62
vii
4.4.1ImplementationDetails62
4.4.2Pseudoalgorithmcode63
4.4.3Applications64
4.4.4BaselineComparisons66
4.4.5AblationStudies68
4.5Conclusion69
Chapter5ConclusionandDiscussion72
References74
AppendixAListofPublications91
viii
LISTOFFIGURES
1.1Traditionalparadigm[107,146](a)ofvisualediting?rstconductsdegradationop-eratorsontrainingdataxtosynthesizeconditionsy,suchaslow-resolutionimages,segmentationmaps,orsketchmaps.Althoughthismethodisstraightforward,itfacesdif?cultiesincollectingopen-domainpairedtrainingdataanddesigninga?exibleframeworktounifyalltranslationtasks.(b)Weproposeanewparadigmutilizingpretrainedgenerativemodelsandconditionedoneditinginstructiontoadapttovariouseditingtasks?exibly.
2.1Theapplicationof6Kimagerescalinginthecontextofcloudphotostorageonsmartphones(e.g.,iCloud).Asmorehigh-resolution(HR)imagesareuploadedtocloudstoragenowadays,challengesarebroughttocloudserviceproviders(CSPs)inful?llinglatency-sensitiveimagereadingrequests(e.g.,zoom-in)throughtheinternet.Tofacilitatefastertransmissionandhigh-qualityvisualcontent,ourHy-perThumbnailframeworkhelpsCSPstoencodeanHRimageintoanLRJPEGthumbnail,whichuserscouldcachelocally.Whentheinternetisunstableorun-available,ourmethodcanstillreconstructahigh-?delityHRimagefromtheJPEGthumbnailinrealtime.
2.2Theoverviewofourapproach.GivenanHRinputimagex,we?rstencodextoitsLRrepresentationywiththeencoderE,wherethescalingfactoriss.Second,wetransformytoDCTcoef?cientsCandpredictthequantizationtablesQL,QCwith
ourquantizationpredictionmodule(QPM).toestimatethebitrateofthequantizedcoef?cientsCattrainingstage.roundingandtruncation,whichwedenotedas[·],the[QL],[QC]and[C]canbewrittenandreadwithoff-the-shelfJPEGAPIatthetestingstage.TorestoretheHR,weextractfeaturesfromCwithafrequencyfeatureextractorfandproducethehigh-?delityimagewiththedecoderD.
2.3ReconstructedHRimagesandLRthumbnailsbydifferentmethodsontheDIV2K[6]validationdataset.WecroptherestoredHRimagestoeasethecomparisonandvisualizetheLRcounterpartsatthebottom-right.ThebppiscalculatedonthewholeimageandthePSNRisevaluatedonthecroppedareaofthereconstructedHRimages.
2.4DownscaledLRthumbnailsbydifferentmethodsonSet14imagecomic.Withasimilartargetbpp,ourmodelintroducesleastartifactsinthethumbnailincompar-isontobaselines.
2.5Modelruntime.Wepro?lethe4×encoderanddecoderatdifferenttargetresolu-tioninhalf-precisionmode.Especially,weconvertourdecoderfromPyTorchtoTensorRTforfurtherinferencetimereduction.
3
7
12
17
20
21
ix
2.6Therate-HR-distortioncurveonKodak[1]dataset.Ourmethod(s=2,4)outperformsJPEG,IRN[153]intheRDperformance.Forthe‘QPM+JPEG’curve,wheres=1,wefollowthestandardJPEGalgorithmandadoptQPMmoduleasapluginfortableprediction.
2.7Visualresultsofperforming4×rescalingontheDIV2K[6]andFiveK[18]datasetswithbaselinemethodsandourmodels.Theimagesarecroppedtoeasethecomparison.Pleasezoominfordetails.
2.8Moreresultsof4×rescalingwithourframeworkonreal-world6Kimages[18].Pleasezoominfordetails.Notethattheimagesherearecompressedduetothesizelimitofcamera-ready.
2.9QuantizationtablesonKodak[1]images.WevisualizethequantizationtableQL(thegreentable)andQC(theorangetable)forkodim04andkodim09ofdifferentquantizationapproaches.ThemodeltrainedwithQPMachievesthebestRDperformancefromeveryaspect.Formoreanalysis,pleaserefertoSec.2.5inourchapter.
2.10QPMversusimage-invariantquantization.We?rsttrainourmodelswithQPM,witha?xedJPEGtableorwithanoptimizedtable,respectively.Then,weevaluatetheatdifferenttargetbitrateonKodak[1]dataset.(a)theRDcurveonreconstructed
HRimageandinputx;(b)theRDcurveonLRthumbnailandtheBicubic
downsampledLRyref.
2.11guidancelossablationonKodak[1]imagekodim17.WevisualizetheHRimageswiththeirLRcounterpartsatthebottom-right.(b)(c)areproducedby4×HyperThumbnailmodelstrainedwithdifferentλ1andthebppis0.4.
3.1WepresentSPIRE:SemanticPrompt-DrivenImageRestoration,atext-basedfoundationmodelforall-in-one,instructedimagerestoration.SPIREallowsusersto?exiblyleverageeithersemantic-levelcontentprompt,ordegradation-awarerestorationprompt,orboth,toobtaintheirdesiredenhancementresultsbasedonpersonalpreferences.Inotherwords,SPIREcanbeeasilypromptedtoconductblindrestoration,semanticrestoration,ortask-speci?cgranulartreatment.Ourframeworkalsoenablesanewparadigmofinstruction-basedimagerestoration,providingareliableevaluationbenchmarktofacilitatevision-languagemodelsforlow-levelcomputationalphotographyapplications.
3.2FrameworkofSPIRE.Inthetrainingphase,webeginbysynthesizingadegradedversiony,ofacleanimagex.Ourdegradationsynthesispipelinealsocreatesarestorationpromptcr,whichcontainsnumericparametersthatre?ectstheintensityofthedegradationintroduced.Then,weinjectthesyntheticrestorationpromptintoaControlNetadaptor,whichusesourproposedmodulationfusionblocks(γ,β)toconnectwiththefrozenbackbonedrivenbythesemanticpromptcs.Duringtesttime,theuserscanemploytheSPIREframeworkaseitherablindrestorationmodelwithrestorationprompt“Removealldegradation”andemptysemanticprompt?,ormanuallyadjusttherestorationcrandsemanticpromptscstoobtainwhattheyaskfor.
23
25
26
27
29
29
31
35
x
3.3Degradationambiguitiesinreal-worldproblems.Byadjustingtherestorationprompt,ourmethodcanpreservethemotioneffectthatiscoupledwiththeaddedGaussianblur,whilefullyblindrestorationmodelsdonotprovidethislevelof?exibility.
3.4Promptspacewalkingvisualizationfortherestorationprompt.Giventhesamedegradedinput(upperleft)andemptysemanticprompt?,ourmethodcandecoupletherestorationdirectionandstrengthviaonlypromptingthequantitativenumberinnaturallanguage.Aninteresting?ndingisthatourmodellearnsacontinuousrangeofrestorationstrengthfromdiscretelanguagetokens.
3.5Restorationpromptingforout-of-domainimages.
3.6VisualComparisonwithotherbaselines.Ourmethodofintegratingboththesemanticpromptcsandtherestorationpromptcroutperformsimge-to-imagerestoration(DiffBIR,RetrainedControlNet-SR)andnaivezero-shotcombinationwithsemanticprompt.Itachievesmoresharpandcleanresultswhilemaintainingconsistencywiththedegradedimage.
3.7Test-timesemanticprompting.Ourframeworkrestoresdegradedimagesguidedby?exiblesemanticprompts,whileunrelatedbackgroundelementsandglobaltonesremainalignedwiththedegradedinputconditioning.Inaddition,Moresemanticpromptingforimageswithmultipleobjects
3.8Mainvisualcomparisonwithbaselines.(Zoominfordetails)
4.1Zero-shottext-drivenvideoediting.Wepresentazero-shotapproachforshape-awarelocalobjecteditingandvideostyleeditingfrompre-traineddiffusionmod-els[150,117]withoutanyoptimizationforeachtargetprompt.
4.2Theoverviewofourapproach.Ourinputistheuser-providedsourcepromptpsrc,targetpromptpeditandcleanlatentz={z1,z2,...zn}encodedfrominputsourcevideox={x1,x2,...xn}withnumberframesninavideosequence.Ontheleft,we?rstinvertthevideousingDDIMinversionpipelineintonoisylatentzTusingthesourcepromptpsrcandanin?ated3DU-Netεθ.Duringeachinversiontimestept,
westorebothspatial-temporalself-attentionmapssandcross-attentionmapsc.
AttheeditingstageoftheDDIMdenoising,wedenoisethelatentzTbacktoclean
image0conditionedontargetpromptpedit.Ateachdenoisingtimestept,wefuse
theattentionmaps(sandc)inεθwithstoredattentionmap(s,c)using
theproposedAttentionBlendingBlock.Right:Speci?cally,wereplacethecross-
attentionmapscofun-editedwords(e.g.,roadandcountryside)withsource
mapscofthem.Inaddition,weblendtheself-attentionmapduringinversion
sandeditingswithanadaptivespatialmaskobtainedfromcross-attention
mapscofeditedwords(e.g..,silverandjeep),whichrepresentstheareasthatthe
userwantstoedit.
4.3Zero-shotlocalattributedediting(cat→tiger)usingstablediffusion.Incontrasttofusionwithattentionduringreconstruction(a)inpreviouswork[49,136,108],ourinversionattentionfusion(b)providesmoreaccuratestructureguidanceandeditingability,asvisualizedontherightside.
42
49
49
50
50
51
53
57
58
xi
4.4Studyofblendedself-attentioninzero-shotshapeediting(rabbit→tiger)usingstablediffusion.Forthand?fthcolumns:Ignoringself-attentioncannotpreservetheoriginalstructureandbackground,andnaivereplacementcausesartifacts.Thirdcolumn:Blendingtheself-attentionusingthecross-attentionmap(thesecondrow)obtainsbothnewshapefromthetargettextwithasimilarposeandbackgroundfromtheinputframe.
4.5Zero-shotobjectshapeeditingonpre-trainedvideodiffusionmodel[150]:Ourframeworkcandirectlyedittheshapeoftheobjectinvideosdrivenbytextpromptsusingatrainedvideodiffusionmodel[150]
4.6Zero-shotattributeandstyleeditingresultsusingStableDiffusion[117].Ourframeworksupportsabstractattributeandstyleeditinglike‘Swarovskicrystal’,‘Ukiyo-e’,and‘MakotoShinkai’.Bestviewedwithzoom-in.
4.7Qualitativecomparisonofourmethodswithotherbaselines.InputsareinFig.4.5andFig4.8.Ourresultshavethebesttemporalconsistency,image?delity,andeditingquality.Bestviewedwithzoom-in.
4.8Applicationoflatentsblending.Extendingourattentionblendingstrategytohigh-resolutionlatent,ourframeworkcanpreservetheaccuratelow-levelcolorandtextureofinput.
4.9Inversionattentioncomparedwithreconstructionattentionusingprompt‘desertedshore‘glaciershore’.Theattentionmapsobtainedfromtherecon-structionstagefailtodetecttheboat’sposition,andcannotprovidesuitablemotionguidanceforzero-shotvideoediting.
4.10Ablationstudyofblendedself-attention.Withoutself-attentionfusion,thegeneratedvideocannotpreservethedetailsofinputvideos(e.g.,fence,trees,andcaridentity).Ifwereplacefullself-attentionwithoutaspatialmask,thestructureoftheoriginaljeepmisleadsthegenerationofthePorschecar.
59
62
63
64
65
67
69
xii
LISTOFTABLES
1.1Thecomparisonofdifferentgenerativemodels.
2.1Thecomparisonofdifferentmethodsrelatedtoimagerescaling.(a)Super-resolutionfromdownsampledJPEGdoesnotoptimizerate-distortionperformanceandcanhardlymaintainhigh?delityduetoinformationlostindownsampling.(b)SOTA?ow-basedimagerescalingmethodsalsoignorethe?lesizeconstraintsand
arenotreal-timefor6Kreconstructionduetothelimitedspeedofinvertiblenetworks.(c)Ourframeworkoptimizesrate-distortionperformancewhilemaintaininghigh-?delityandreal-time6Kimagerescaling.
2.2Quantitativeevaluationofupscalingef?ciencyandreconstruction?delity.Wekeepbpparound0.3onKodak[1]fordifferentmethods,andthedistortionismeasuredbythePSNRonthereconstructedHRimages.OurapproachoutperformsothermethodswithbetterHRreconstructionandasigni?cantlylowerruntime.WemeasuretherunningtimeandGMacsofallmodelsbyupscalinga960×540LRimagetoa3840×2160HRimage.ThemeasurementsaremadeonanNvidiaRTX
3090GPUwithPyTorch-1.11.0inhalf-precisionmodeforafaircomparison.
2.3Architecturesofourencoder.
2.4Architecturesofouref?cientdecoder.
2.5Quantitativeevaluationofthe4×downsampledLRthumbnailsbydifferentmethods.Thetargetbitrateisaround0.3bpponKodak[1]forallmethods,andwetakeBicubicLRasthegroundtruth.Ourthumbnailpreservesvisualcontentsbetter.
2.6ComparisonofourHyperThumbnailframeworkagainstlearnedcompressionwithJPEGthumbnail.Inadditionalbaseline,weprovideaJPEGthumbnailbesides learnedcompression,andtakethesumofbitstreamsizeandJPEGsizetocal-culatethe?nalbpp.Ourframeworkhasbetterrate-distortionperformancethan“Compression+JPEG”baseline.
2.7Ablationstudyofourencoder-decoderarchitecturesonthedownsampling/upsam-plingtimeandthePSNRofreconstructedHRimage/LRthumbnail.
2.8Quantitativeevaluationforoptimization-basedrescaling.
2.9HRreconstructionPSNRwithdifferentdecodercapacity.
3.1QuantitativeresultsontheMS-COCOdataset(withcs)usingourparameterizeddegradation(left)andReal-ESRGANdegradation(right).Wealsodenotethepromptchoiceattesttime.‘Sem’standsforsemanticprompt;‘Res’standsforrestorationprompt.The?rstgroupofbaselinesaretestedwithoutprompt.Thesecondgrouparecombinedwithsemanticpromptinzero-shotway.
2
8
16
17
18
21
22
24
27
30
42
xiii
3.2Ourtrainingdegradationisrandomlysampledinthesetwopipelinewith50%each.(1)DegradedimagesysynthesizedbyReal-ESRGANarepairedwiththesamerestorationpromptcr=“Removealldegradation”(2)Inother50%iterations,imagesgeneratedbyourparameterizedpipelinearepairedwitheitherarestorationtypeprompt(e.g.,“Deblur”)orarestorationparameterprompt(e.g.,“Deblurwithsigma0.3;”).
3.3NumericalresultsonDIV2Ktestsetwithoutanyprompt.
3.4Ablationofarchitectureanddegradationstrengthincr
3.5Ablationofpromptsprovidedduringbothtrainingandtesting.Weuseanimage-to-imagemodelwithourmodulationfusionlayerasourbaseline.Providingsemanticpromptssigni?cantlyincreasestheimagequality(1.9lowerFID)andsemanticsimilarity(0.002CLIP-Image),butresultsinworsepixel-levelsimilarity.Incontrast,degradationtypeinformationembeddedinrestorationpromptsimprovesbothpixel-level?delityandimagequality.Utilizingdegradationparametersintherestorationinstructionsfurtherimprovesthesemetrics.
3.6Ablationofthearchitecture.Modulatingtheskipfeaturefskipimprovesthe?-delityoftherestoredimagewith3%extraparametersintheadaptor,whilefurthermodulatingthebackbonefeaturesfupdoesnotbringobviousadvantage.
4.1Quantitativeevaluationagainstbaselines.Inouruserstudy,theresultsofourmethodarepreferredoverthosefrombaselines.ForCLIP-Score,weachievethebesttemporalconsistencyandcomparableframewiseeditingaccuracy
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- DB35T 2234-2024交趾黃檀容器苗培育技術(shù)規(guī)程
- 鄉(xiāng)村民宿合作協(xié)議合同模板
- 產(chǎn)品加工的委托合同
- 二手車轉(zhuǎn)讓合同模板
- 交通設施采購及養(yǎng)護合同范本
- 親屬間房屋無償贈與合同
- 個人農(nóng)村小產(chǎn)權(quán)房抵押融資合同
- 個體合作經(jīng)營收益分配合同
- 產(chǎn)業(yè)協(xié)同發(fā)展合同范本
- 個人合伙創(chuàng)業(yè)合同書范本
- 部編版語文小學二年級下冊第一單元集體備課(教材解讀)
- Photoshop 2022從入門到精通
- T-GDWJ 013-2022 廣東省健康醫(yī)療數(shù)據(jù)安全分類分級管理技術(shù)規(guī)范
- 校本課程生活中的化學
- DB43-T 2775-2023 花櫚木播種育苗技術(shù)規(guī)程
- 《我的家族史》課件
- 高空作業(yè)安全方案及應急預案
- 蘇教版科學2023四年級下冊全冊教案教學設計及反思
- 八-十-天-環(huán)-游-地-球(讀書)專題培訓課件
- 新會中集:集裝箱ISO尺寸要求
- 化學品-泄露與擴散模型課件
評論
0/150
提交評論