iccv2019論文全集8410-model-compression-with-adversarial-robustness-a-unified-optimization-framework_第1頁(yè)
iccv2019論文全集8410-model-compression-with-adversarial-robustness-a-unified-optimization-framework_第2頁(yè)
iccv2019論文全集8410-model-compression-with-adversarial-robustness-a-unified-optimization-framework_第3頁(yè)
iccv2019論文全集8410-model-compression-with-adversarial-robustness-a-unified-optimization-framework_第4頁(yè)
iccv2019論文全集8410-model-compression-with-adversarial-robustness-a-unified-optimization-framework_第5頁(yè)
已閱讀5頁(yè),還剩7頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Model Compression with Adversarial Robustness: A Unifi ed Optimization Framework Shupeng Gui?, Haotao Wang, Haichuan Yang?, Chen Yu?, Zhangyang Wangand Ji Liu ?Department of Computer Science, University of Rochester Department of Computer Science and Engineering, Texas Ax,y)be the loss function that

2、 the target model aims to minimize, wheredenotes the model parameters and(x,y)the training pairs. The adversarial loss, i.e., the training objective for the attacker, is defi ned by fadv(;x,y) =max x0B (x) f(;x0,y)(1) It could be understood that the maximum (worst) target model loss attainable at an

3、y point within B (x). Next, since the target model needs to defend against the attacker, it requires to suppress the worst risk. Therefore, the overall objective for the target model to gain adversarial robustness could be expressed as Z denotes the training data set: min X (x,y)Z fadv(;x,y).(2) 2.2

4、Integrating Pruning, Factorization and Quantization for the ATMC Constraint As we reviewed previously, typical CNN model compression strategies include pruning (element-level 9, or channel-level 8), low-rank factorization 15,12,16, and quantization 22,18,19. In this work, we aim to integrate all thr

5、ee into a unifi ed, fl exible structural constraint. Without loss of generality, we denote the major operation of a CNN layer (either convolutional or fully-connected) asxout= Wxin,W Rmn,m n; computing the non-linearity (neuron) parts takes minor resources compared to the large-scale matrix-vector m

6、ultiplication. The basic pruning 9 encourages the elements ofWto be zero. On the other hand, the factorization-based methods decomposesW = W1W2. Looking at the two options, we propose to enforce the following structure to W (k is a hyperparameter): W = UV + C,kUk0+ kV k0+ kCk0 k,(3) wherek k0denotes

7、 the number of nonzeros of the augment matrix. The above enforces a novel, compound (including both multiplicative and additive) sparsity structure onW, compared to existing sparsity structures directly on the elements ofW. Decomposing a matrix into sparse factors was studied before 53, but not in a

8、 model compression context. We further allow for a sparse 3 errorC for more fl exibility, as inspired from robust optimization 54. By default, we choose U Rmm,V Rmnin (3). Many extensions are clearly available for equation 3. For example, the channel pruning 8 enforces rows ofWto be zero, which coul

9、d be considered as a specially structured case of basic element-level pruning. It could be achieved by a drop-in replacement of group-sparsity norms. We choose0norm here both for simplicity, and due to our goal here being focused on reducing model size only. We recognize that using group sparsity no

10、rms in (3) might potentially be a preferred option if ATMC will be adapted for model acceleration. Quantization is another powerful strategy for model compression 19,20,55. To maximize the representation capability after quantization, we choose to use the nonuniform quantization strategy to represen

11、t the nonzero elements in DNN parameter, that is, each nonzero element of the DNN parameter can only be chosen from a set of a few values and these values are not necessarily evenly distributed and need to be optimized. We use the notation| |0to denote the number of different values except 0 in the

12、augment matrix, that is, |M|0:= |Mi,j: Mi,j6= 0 i j| For example, forM = 0,1;4;1,kMk0= 3and|M|0= 2. To answer the all nonzero elements ofU(l),V (l),C(l), we introduce the non-uniform quantization strategy (i.e., the quantization intervals or thresholds are not evenly distributed). We also do not pre

13、-choose those thresholds, but instead learn them directly with ATMC, by only constraining the number of unique nonzero values through predefi ning the number of representation bits b in each matrix, such as |U(l)|0 2b, |V (l)|0 2b, |C(l)|0 2bl L. 2.3ATMC: Formulation Let us use to denote the (re-par

14、ameterized) weights in all L layers: := U(l),V (l),C(l)L l=1. We are now ready to present the overall constrained optimization formulation of the proposed ATMC framework, combining all compression strategies or constraints: min X (x,y)Z fadv(;x,y)(4) s.t. L X l=1 kU(l)k0+ kV (l)k0 + kC(l)k0 |z kk0 k

15、, (sparsity constraint) Qb:= : |U(l)|0 2b, |V (l)|0 2b, |C(l)|0 2bl L.(quantization constraint) Both k and b are hyper-parameters in ATMC: k controls the overall sparsity of , and b controls the quantization bit precision per nonzero element. They are both “global” for the entire model rather than l

16、ayer-wise, i.e., setting only the two hyper-parameters will determine the fi nal compression. We note that it is possible to achieve similar compression ratios using different combinations ofkandb (but likely leading to different accuracy/robustness), and the two can indeed collaborate or trade-off

17、with each other to achieve more effective compression. 2.4Optimization The optimization in equation 4 is a constrained optimization with two constraints. The typical method to solve the constrained optimization is using projected gradient descent or projected stochastic gradient descent, if the proj

18、ection operation (onto the feasible set defi ned by constraints) is simple enough. Unfortunately, in our case, this projection is quite complicated, since the intersection of the sparsity constraint and the quantization constraint is complicated. However, we notice that the projection onto the feasi

19、ble set defi ned by each individual constraint is doable (the projection onto the sparsity constraint is quite standard, how to do effi cient projection onto the quantization constraint 4 defi ned set will be clear soon). Therefore, we apply the ADMM 56 optimization framework to split these two cons

20、traints by duplicating the optimization variable. First the original optimization formulation equation 4 can be rewritten as by introducing one more constraint min kk0k, 0Qb X (x,y)Z fadv(;x,y)(5) s.t. = 0. It can be further cast into a minimax problem by removing the equality constraint = 0: min kk

21、0k, 0Qb max u X (x,y)Z fadv(;x,y) + hu, 0i + 2k 0k2 F (6) where 0 is a predefi ned positive number in ADMM. Plug the form offadv, we obtain a complete minimax optimization min kk0k, 0Qb max u, xadvB (x)(x,y)Z X (x,y)Z f(;x0,y) + hu, 0i + 2k 0k2 F (7) ADMM essentially iteratively minimizes variables

22、and 0, and maximizes u and all xadv. Update uWe update the dual variable asut+1= ut+ ( 0), which can be considered to be a gradient ascent step with learning rate 1/. Update xadvWe update xadvfor sampled data (x,y) by xadv Projx0:kx0xkx + xf(;x,y) Update The fi rst step is to optimize in equation 7

23、(fi xing other variables) which is only related to the sparsity constraint. Therefore, we are essentially solving min f(;xadv,y) + 2k 0 + uk2 F s.t. kk0 k. Since the projection onto the sparsity constraint is simply enough, we can use the projected stochastic gradient descent method by iteratively u

24、pdating as Proj00:k00k0k ? h f(;xadv,y) + 2k 0 + uk2 F i? . 00: k00k0 k denotes the feasible domain of the sparsity constraint. tis the learning rate. Update 0The second step is to optimize equation 7 with respect to0 (fi xing other variables), which is essentially solving the following projection p

25、roblem min 0 k0 ( + u)k2 F, s.t. 0 Qb.(8) To take a close look at this formulation, we are essentially solving the following particular one dimensional clustering problem with 2b+ 1 clusters on + u (for each U(l), V (l), and C(l) min U,ak2b k=1 kU Uk2 F s.t.Ui,j 0,a1,a2, ,a2b. The major difference f

26、rom the standard clustering problem is that there is a constant cluster0. Take U0(l)as an example, the update rule of0isU0(l) t = ZeroKmeans2b(U(l)+ uU(l), whereuU(l) is the dual variable with respect toU(l)in . Here we use a modifi ed Lloyds algorithm 57 to solve equation 8. The detail of this algo

27、rithm is shown in Algorithm 1, We fi nally summarize the full ATMC algorithm in Algorithm 2. 3Experiments To demonstrate that ATMC achieves remarkably favorable trade-offs between robustness and model compactness, we carefully design experiments on a variety of popular datasets and models as summa-

28、rized in Section 3.1. Specifi cally, since no algorithm with exactly the same goal (adversarially robust compression) exists off-the-shelf, we craft various ablation baselines, by sequentially composing different compression strategies with the state-of-the-art adversarial training 31. Besides, we s

29、how that the robustness of ATMC compressed models generalizes to different attackers. 5 Table 1: The datasets and CNN models used in the experiments. Models#Parametersbit widthModel Size (bits)Dataset x,y)? 9:end for 10:Proj00:k00k0k? tf(;xadv,y)+ 2k 0+uk2 F ? 11:0 ZeroKmeans2b( + u) 12:u u + ( 0) 1

30、3:end for 3.1Experimental Setup Datasets and Benchmark Models As in Table 1, we select four popular image classifi cation datasets and pick one top-performer CNN model on each: LeNet on the MNIST dataset 58; ResNet34 59 on CIFAR-10 60 and CIFAR-100 60; and WideResNet 61 on SVHN 62. Evaluation Metric

31、s The classifi cation accuracies on both benign and on attacked testing sets are reported, the latter being widely used to quantify adversarial robustness, e.g., in 31. The model size is computed via multiplying the quantization bit per element with the total number of non-zero elements, added with

32、the storage size for the quantization thresholds ( equation 8). The compression ratio is defi ned by the ratio between the compressed and original model sizes. A desired model compression is then expected to achieve strong adversarial robustness (accuracy on the attacked testing set), in addition to

33、 high benign testing accuracy, at compression ratios from low to high . ATMC Hyper-parametersFor ATMC, there are two hyper-parameters in equation 4 to control compression ratios:kin the sparsity constraint, andbin the quantization constraint. In our experi- ments, we try 32-bit (b= 32) full precisio

34、n, and 8-bit (b= 8) quantization; and then vary differentk values under either bit precision, to navigate different compression ratios. We recognize that a better compression-robustness trade-off is possible via fi ne-tuning, or perhaps to jointly search for,kandb. Training SettingsFor adversarial t

35、raining, we apply the PGD 31 attack to fi nd adversarial samples. Unless otherwise specifi ed, we set the perturbation magnitudeto be 76 for MNIST and 4 for the other three datasets. (The color scale of each channel is between 0 and 255.) Following the settings in 31, we set PGD attack iteration num

36、bersnto be 16 for MNIST and 7 for the other three datasets. We follow 29 to set PGD attack step sizeto bemin( + 4,1.25)/n. We train ATMC for 50, 150, 150, 80 epochs on MNIST, CIFAR10, CIFAR100 and SVHN respectively. 6 Adversarial Attack SettingsWithout further notice, we use PGD attack with the same

37、 settings as used in adversarial training on testing sets to evaluate model robustness. In section 3.3, we also evaluate model robustness on PGD attack, FGSM attack 28 and WRM attack 44 with varying attack parameter settings to show the robustness of our method across different attack settings. 3.2C

38、omparison to Pure Compression, Pure Defense, and Their Mixtures Since no existing work directly pursues our same goal, we start from two straightforward baselines to be compared with ATMC: standard compression (without defense), and standard defense (without compression). Furthermore, we could craft

39、 “mixture” baselines to achieve the goal: fi rst applying a defense method on a dense model, then compressing it, and eventually fi ne-tuning the compressed model (with parameter number unchanged, e.g., by fi xing zero elements) using the defense method again. We design the followingsevencomparison

40、methods (the default bit precision is 32 unless otherwise specifi ed): Non-Adversarial Pruning (NAP): we train a dense state-of-the-art CNN and then compress it by the pruning method proposed in 9: only keeping the largest-magnitudes weight elements while setting others to zero, and then fi ne-tune

41、the nonzero weights (with zero weights fi xed) on the training set again until convergence, . NAP can thus explicitly control the compressed model size in the same way as ATMC. There is no defense in NAP. Dense Adversarial Training (DA): we apply adversarial training 31 to defend a dense CNN, with n

42、o compression performed. Adversarial Pruning (AP ): we fi rst apply the defensive method 31 to pre-train a defense CNN. We then prune the dense model into a sparse one 9 , and fi ne-tune the non-zero weights of pruned model until convergence, similarly to NAP. Adversarial0Pruning (A0): we start from

43、 the same pre-trained dense defensive CNN used by AP and then apply0projected gradient descent to solve the constrained optimization problem with an adversarial training objective and a constraint on number of non-zero parameters in the CNN. Note that this is in essence a combination of one state-of

44、-the-art compression method 63 and PGD adversarial training. Adversarial Low-Rank Decomposition (ALR): it is all similar to the AP routine, except that we use low rank factorization 16 in place of pruning to achieve the compression step. ATMC (8 bits, 32 bits): two ATMC models with different quantiz

45、ation bit precisions are chosen. For either one, we will vary k for different compression ratios. 103102101 Model Size Compression Ratio 0.2 0.4 0.6 0.8 1.0 Accuracy Original NAP AP ALR AL0 ATMC-32bits ATMC-8bits DA 103102101 Model Size Compression Ratio 0.5 0.6 0.7 0.8 0.9 Accuracy Original NAP AP

46、ALR AL0 ATMC-32bits ATMC-8bits DA 103102101 Model Size Compression Ratio 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Accuracy Original NAP AP ALR AL0 ATMC-32bits ATMC-8bits DA 103102101 Model Size Compression Ratio 0.4 0.5 0.6 0.7 0.8 0.9 Accuracy Original NAP AP ALR AL0 ATMC-32bits ATMC-8bits DA 10 3 10 2 10 1 Mod

47、el Size Compression Ratio 0.0 0.2 0.4 0.6 0.8 Accuracy PGD Attack NAP AP ALR AL0 ATMC-32bits ATMC-8bits DA (a) MNIST 10 3 10 2 10 1 Model Size Compression Ratio 0.0 0.2 0.4 0.6 Accuracy PGD Attack NAP AP ALR AL0 ATMC-32bits ATMC-8bits DA (b) CIFAR-10 10 3 10 2 10 1 Model Size Compression Ratio 0.0 0

48、.1 0.2 0.3 Accuracy PGD Attack NAP AP ALR AL0 ATMC-32bits ATMC-8bits DA (c) CIFAR-100 10 3 10 2 10 1 Model Size Compression Ratio 0.0 0.2 0.4 0.6 Accuracy PGD Attack NAP AP ALR AL0 ATMC-32bits ATMC-8bits DA (d) SVHN Figure 1: Comparison among NAP, AP, A0, ALR and ATMC (32 bits although the practical

49、ly achievable trade-off differs by method. For example, while NAP (a standard CNN compression) obtains decent accuracy results on benign testing sets (e.g., the best on CIFAR-10 and CIFAR-100), it becomes very deteriorated in terms of robustness under adversarial attacks. That verifi es our motivati

50、ng intuition: naive compression, while still maintaining high standard accuracy, can signifi cantly compromise robustness the “hidden price” has indeed been charged. The observation also raises a red fl ag for current evaluation ways of CNN compression, where the robustness of compressed models is (

51、almost completely) overlooked. Second, while both AP and ALR consider compression and defense in ad-hoc sequential ways, A0 and ATMC-32 bits further gain notably advantages over them via “joint optimization” type methods, in achieving superior trade-offs between benign test accuracy, robustness, and

52、 compression ratios. Furthermore, ATMC-32 bits outperforms A0especially at the low end of compression ratios. That is owing to the the new decomposition structure that we introduced in ATMC. Third, ATMC achieves comparable test accuracy and robustness to DA, with only minimal amounts of parameters a

53、fter compression. Meanwhile, ATMC also achieves very close, sometimes better accuracy-compression ratio trade-offs on benign testing sets than NAP, with much enhanced robust- ness. Therefore, it has indeed combined the best of both worlds. It also comes to our attention that for ATMC-compressed mode

54、ls, the gaps between their accuracies on benign and attacked testing sets are smaller than those of the uncompressed original models. That seems to potentially suggest that compression (when done right) in turn has positive performance regularization effects. Lastly, we compare between ATMC-32bits a

55、nd ATMC-8bits. While ATMC-32bits already outper- forms other baselines in terms of robustness-accuracy trade-off, more aggressive compression can be achieved by ATMC-8bits (with further around four-time compression at the same sparsity level), with still competitive performance. The incorporation of

56、 quantization and weight pruning/decomposition in one framework allows us to fl exibly explore and optimize their different combinations. 0.0020.0040.0060.0080.010 Model Size Compression Ratio 0.0 0.2 0.4 0.6 0.8 Accuracy PGD-eps:2 NAP AP ALR ATMC-32bits ATMC-8bits (a) PGD, perturbation=2 0.0020.004

57、0.0060.0080.010 Model Size Compression Ratio 0.0 0.1 0.2 0.3 0.4 Accuracy PGD-eps:8 NAP AP ALR ATMC-32bits ATMC-8bits (b) PGD, perturbation=8 0.0020.0040.0060.0080.010 Model Size Compression Ratio 0.1 0.2 0.3 0.4 0.5 0.6 Accuracy FGSM-eps:4 NAP AP ALR ATMC-32bits ATMC-8bits (c) FGSM, perturbation=4

58、0.0020.0040.0060.0080.010 Model Size Compression Ratio 0.2 0.3 0.4 0.5 0.6 0.7 0.8 Accuracy WRM-iter:7 NAP AP ALR ATMC-32bits ATMC-8bits (d) WRM, penalty=1.3, it- eration=7 Figure 2: Robustness-model size trade-off under different attacks and perturbation levels. Note that the accuracies here are all measured by attacked images, i.e., indicating robustness. 3.3Generalized Robustness Against Other Attackers In all previous experiments, we test ATMC and other baselines against the PGD attacker at certain fi xed perturbation levels. We will now show that the superio

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論