生成式AI紅隊百次測試經(jīng)驗白皮書_第1頁
生成式AI紅隊百次測試經(jīng)驗白皮書_第2頁
生成式AI紅隊百次測試經(jīng)驗白皮書_第3頁
生成式AI紅隊百次測試經(jīng)驗白皮書_第4頁
生成式AI紅隊百次測試經(jīng)驗白皮書_第5頁
已閱讀5頁,還剩33頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

Lessonsfromredteaming100

generativeAIproducts

Authoredby:

MicrosoftAIRedTeam

Lessonsfromredteaming100generativeAIproducts2

Authors

BlakeBullwinkel,AmandaMinnich,ShivenChawla,GaryLopez,MartinPouliot,WhitneyMaxwell,JorisdeGruyter,KatherinePratt,SaphirQi,NinaChikanov,RomanLutz,RajaSekharRaoDheekonda,Bolor-ErdeneJagdagdorj,

EugeniaKim,JustinSong,KeeganHines,DanielJones,GiorgioSeveri,RichardLundeen,SamVaughan,

VictoriaWesterhoff,PeteBryan,RamShankarSivaKumar,YonatanZunger,ChangKawaguchi,MarkRussinovich

Lessonsfromredteaming100generativeAIproducts3

Tableofcontents

04

Abstract

07

Redteaming

operations

09

Casestudy#1

Jailbreakingavision

languagemodeltogenerate

hazardouscontent

12

Lesson4

Automationcanhelpcover

moreoftherisklandscape

05

Introduction

08

Lesson1

Understandwhatthesystem

candoandwhereitisapplied

10

Lesson3

AIredteamingisnot

safetybenchmarking

12

Lesson5

ThehumanelementofAI

redteamingiscrucial

05

AIthreatmodel

ontology

08

Lesson2

Youdon’thavetocompute

gradientstobreakanAIsystem

11

Casestudy#2

AssessinghowanLLMcouldbe

usedtoautomatescams

13

Casestudy#3

Evaluatinghowachatbot

respondstoauserindistress

14

Casestudy#4

Probingatext-to-image

generatorforgenderbias

16

Casestudy#5

SSRFinavideo-processing

GenAIapplication

14

Lesson6

ResponsibleAIharmsare

pervasivebutdifficulttomeasure

17

Lesson8

TheworkofsecuringAIsystems

willneverbecomplete

15

Lesson7

LLMsamplifyexistingsecurity

risksandintroducenewones

18

Conclusion

Lessonsfromredteaming100generativeAIproducts4

Abstract

Inrecentyears,AIredteaminghasemergedasapracticeforprobingthesafetyandsecurityofgenerativeAI

systems.Duetothenascencyofthefield,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconducted.Basedonourexperienceredteamingover100generativeAIproductsatMicrosoft,wepresentourinternalthreatmodelontologyandeightmainlessonswehavelearned:

1.Understandwhatthesystemcandoandwhereitisapplied

2.Youdon’thavetocomputegradientstobreakanAIsystem

3.AIredteamingisnotsafetybenchmarking

4.Automationcanhelpcovermoreoftherisklandscape

5.ThehumanelementofAIredteamingiscrucial

6.ResponsibleAIharmsarepervasivebutdifficulttomeasure

7.Largelanguagemodels(LLMs)amplifyexistingsecurityrisksandintroducenewones

8.TheworkofsecuringAIsystemswillneverbecomplete

Bysharingtheseinsightsalongsidecasestudiesfromouroperations,weofferpracticalrecommendationsaimedataligningredteamingeffortswithrealworldrisks.WealsohighlightaspectsofAIredteamingthatwebelieveareoftenmisunderstoodanddiscussopenquestionsforthefieldtoconsider.

Lessonsfromredteaming100generativeAIproducts5

Introduction

AsgenerativeAI(GenAI)systemsareadoptedacrossanincreasingnumberofdomains,AIredteaminghasemergedasacentralpracticeforassessingthesafetyandsecurityofthesetechnologies.Atitscore,AIredteamingstrivestopushbeyondmodel-levelsafety

benchmarksbyemulatingreal-worldattacksagainstend-to-endsystems.However,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconductedandahealthydoseofskepticismabouttheefficacyofcurrentAIredteamingefforts[4,8,32].Inthispaper,wespeaktosomeoftheseconcernsbyprovidinginsightintoourexperienceredteaming

over100GenAIproductsatMicrosoft.Thepaper

isorganizedasfollows:First,wepresentthethreat

modelontologythatweusetoguideouroperations.Second,weshareeightmainlessonswehavelearnedandmakepracticalrecommendationsforAIred

teams,alongwithcasestudiesfromouroperations.

Inparticular,thesecasestudieshighlighthowour

ontologyisusedtomodelabroadrangeofsafety

andsecurityrisks.Finally,weclosewithadiscussionofareasforfuturedevelopment.

Background

TheMicrosoftAIRedTeam(AIRT)grewoutofpre-existingredteaminginitiativesatthecompanyandwasofficiallyestablishedin2018.Atitsconception,theteamfocusedprimarilyonidentifyingtraditionalsecurityvulnerabilitiesandevasionattacksagainstclassicalMLmodels.Sincethen,boththescopeandscaleofAIredteamingatMicrosofthaveexpandedsignificantlyinresponsetotwomajortrends.

First,AIsystemshavebecomemoresophisticated,

compellingustoexpandthescopeofAIredteaming.Mostnotably,state-of-the-art(SoTA)modelshave

gainednewcapabilitiesandsteadilyimprovedacrossarangeofperformancebenchmarks,introducing

novelcategoriesofrisk.Newdatamodalities,suchasvisionandaudio,alsocreatemoreattackvectorsforredteamingoperationstoconsider.Inaddition,agenticsystemsgrantthesemodelshigherprivilegesandaccesstoexternaltools,expandingboththe

attacksurfaceandtheimpactofattacks.

Second,Microsoft’srecentinvestmentsinAIhave

spurredthedevelopmentofmanymoreproductsthatrequireredteamingthaneverbefore.Thisincrease

involumeandtheexpandedscopeofAIredteaminghaverenderedfullymanualtestingimpractical,

forcingustoscaleupouroperationswiththehelpofautomation.Toachievethisgoal,wedevelopPyRIT,

anopen-sourcePythonframeworkthatouroperatorsutilizeheavilyinredteamingoperations[27].By

augmentinghumanjudgementandcreativity,PyRIThasenabledAIRTtoidentifyimpactfulvulnerabilitiesmorequicklyandcovermoreoftherisklandscape.ThesetwomajortrendshavemadeAIredteamingamorecomplexendeavorthanitwasin2018.In

thenextsection,weoutlinetheontologywehavedevelopedtomodelAIsystemvulnerabilities.

AIthreatmodel

ontology

Asattacksandfailuremodesincreaseincomplexity,itishelpfultomodeltheirkeycomponents.Basedonourexperienceredteamingover100GenAIproductsforawiderangeofrisks,wedevelopedanontologytodoexactlythat.Figure1illustratesthemain

componentsofourontology:

?System:Theend-to-endmodelorapplicationbeingtested.

?Actor:ThepersonorpersonsbeingemulatedbyAIRT.NotethattheActor’sintentcouldbeadversarial(e.g.,ascammer)orbenign(e.g.,atypicalchatbotuser).

?TTPs:TheTactics,Techniques,andProceduresleveragedbyAIRT.AtypicalattackconsistsofmultipleTacticsandTechniques,whichwemaptoMITREATT&CK?andMITREATLASMatrixwheneverpossible.

–Tactic:High-levelstagesofanattack(e.g.,reconnaissance,MLmodelaccess).

–Technique:Methodsusedtocompleteanobjective(e.g.,activescanning,jailbreak).

–Procedure:ThestepsrequiredtoreproduceanattackusingtheTacticsandTechniques.

?Weakness:ThevulnerabilityorvulnerabilitiesintheSystemthatmaketheattackpossible.

?Impact:Thedownstreamimpactcreatedbytheattack(e.g.,privilegeescalation,generationofharmfulcontent).

Itisimportanttonotethatthisframeworkdoesnotassumeadversarialintent.Inparticular,AIRTemulatesbothadversarialattackersandbenignuserswho

encountersystemfailuresunintentionally.PartofthecomplexityofAIredteamingstemsfromthewiderangeofimpactsthatcouldbecreatedbyanattack

Lessonsfromredteaming100generativeAIproducts6

orsystemfailure.Inthelessonsbelow,weshare

casestudiesdemonstratinghowourontologyis

flexibleenoughtomodeldiverseimpactsintwomaincategories:securityandsafety.

Securityencompasseswell-knownimpactssuch

asdataexfiltration,datamanipulation,credential

dumping,andothersdefinedinMITREATT&CK?,awidelyusedknowledgebaseofsecurityattacks.WealsoconsidersecurityattacksthatspecificallytargettheunderlyingAImodelsuchasmodelevasion,

promptinjections,denialofAIservice,andotherscoveredbytheMITREATLASMatrix.

Safetyimpactsarerelatedtothegenerationofillegalandharmfulcontentsuchashatespeech,violence

andself-harm,andchildabusecontent.AIRTworkscloselywiththeOfficeofResponsibleAItodefinethesecategoriesinaccordancewithMicrosoft’s

ResponsibleAIStandard[25].Werefertothese

impactsasresponsibleAI(RAI)harmsthroughoutthisreport.

Tounderstandthisontologyincontext,consider

thefollowingexample.Imagineweareredteaming

anLLM-basedcopilotthatcansummarizeauser’s

emails.Onepossibleattackagainstthissystemwouldbeforascammertosendanemailthatcontainsa

hiddenpromptinjectioninstructingthecopilotto

“ignorepreviousinstructions”andoutputamaliciouslink.Inthisscenario,theActoristhescammer,who

isconductingacross-promptinjectionattack(XPIA),whichexploitsthefactthatLLMsoftenstruggleto

distinguishbetweensystem-levelinstructionsand

userdata[4].ThedownstreamImpactdependsonthenatureofthemaliciouslinkthatthevictimmightclickon.Inthisexample,itcouldbeexfiltratingdataor

installingmalwareontotheuser’scomputer.

Actor

Conducts

TTPs

:

Leverages

Attack

Exploits

Mitigation

:

Mitigatedby

Weakness

Occursin

:

System

Creates

Impact

Figure1:MicrosoftAIRTontologyformodelingGenAIsystemvulnerabilities.AIRToftenleveragesmultipleTTPs,whichmayexploitmultipleWeaknessesandcreatemultipleImpacts.Inaddition,morethanoneMitigationmaybenecessarytoaddressaWeakness.NotethatAIRTistaskedonlywithidentifyingrisks,whileproductteamsareresourcedtodevelopappropriatemitigations.

Lessonsfromredteaming100generativeAIproducts7

Redteaming

operations

Inthissection,weprovideanoverviewofthe

operationswehaveconductedsince2021.Intotal,wehaveredteamedover100GenAIproducts.Broadly

speaking,theseproductscanbebucketedinto

“models”and“systems.”Modelsaretypicallyhostedonacloudendpoint,whilesystemsintegratemodelsintocopilots,plugins,andotherAIappsandfeatures.Figure2showsthebreakdownofproductswehave

redteamedsince2021.Figure3showsabarchartwiththeannualpercentageofouroperationsthathave

probedforsafety(RAI)vs.securityvulnerabilities.

In2021,wefocusedprimarilyonapplicationsecurity.Althoughouroperationshaveincreasinglyprobed

forRAIimpacts,ourteamcontinuestoredteamforsecurityimpactsincludingdataexfiltration,credentialleaking,andremotecodeexecution.Organizations

haveadoptedmanydifferentapproachestoAIred

teamingrangingfromsecurity-focusedassessmentswithpenetrationtestingtoevaluationsthattarget

onlyGenAIfeatures.InLessons2and7,weelaborateonsecurityvulnerabilitiesandexplainwhywebelieveitisimportanttoconsiderbothtraditionalandAI-

specificweaknesses.

AfterthereleaseofChatGPTin2022,MicrosoftenteredtheeraofAIcopilots,startingwithAI-poweredBingChat,releasedinFebruary2023.

Thismarkedaparadigmshifttowardsapplications

thatconnectLLMstoothersoftwarecomponents

includingtools,databases,andexternalsources.

Applicationsalsostartedusinglanguagemodelsas

reasoningagentsthatcantakeactionsonbehalfof

users,introducinganewsetofattackvectorsthat

haveexpandedthesecurityrisksurface.InLesson

7,weexplainhowtheseattackvectorsbothamplifyexistingsecurityrisksandintroducenewones.

Inrecentyears,themodelsatthecenterofthese

applicationshavegivenrisetonewinterfaces,

allowinguserstointeractwithappsusingnatural

languageandrespondingwithhigh-qualitytext,

image,video,andaudiocontent.DespitemanyeffortstoalignpowerfulAImodelstohumanpreferences,

manymethodshavebeendevelopedtosubvert

safetyguardrailsandelicitcontentthatisoffensive,unethical,orillegal.Weclassifytheseinstancesof

harmfulcontentgenerationasRAIimpactsandin

Lessons3,5,and6discusshowwethinkabouttheseimpactsandthechallengesinvolved.

Inthenextsection,weelaborateontheeightmain

lessonswehavelearnedfromouroperations.Wealsohighlightfivecasestudiesfromouroperationsand

showhoweachonemapstoourontologyinFigure1.WehopetheselessonsareusefultoothersworkingtoidentifyvulnerabilitiesintheirownGenAIsystems.

80+100+

OpsProducts

Plugins

AppsandFeatures

Copilots

15%

16%

24%

Models

45%

Figure2:PiechartshowingthepercentagebreakdownofAI

productsthatAIRThastested.AsofOctober2024,wehave

conductedover80operationscoveringmorethan100products.

Percentageofopsprobingsafetyvs.security

Safety(RAI)%Security%

100

80

60

40

20

0

2021202220232024

Figure3:Barchartshowingthepercentageofoperationsthatprobedsafety(RAI)vs.securityvulnerabilitiesfrom2021–2024.

Lessonsfromredteaming100generativeAIproducts8

Lessons

Lesson1:

Understandwhatthesystem

candoandwhereitisapplied

ThefirststepinanAIredteamingoperationisto

determinewhichvulnerabilitiestotarget.Whilethe

ImpactcomponentoftheAIRTontologyisdepictedattheendofourontology,itservesasanexcellent

startingpointforthisdecision-makingprocess.

Startingfrompotentialdownstreamimpacts,rather

thanattackstrategies,makesitmorelikelythatan

operationwillproduceusefulfindingstiedtoreal

worldrisks.Aftertheseimpactshavebeenidentified,redteamscanworkbackwardsandoutlinethevariouspathsthatanadversarycouldtaketoachievethem.

Anticipatingdownstreamimpactsthatcouldoccurintherealworldisoftenachallengingtask,butwefindthatitishelpfultoconsider1)whattheAIsystemcando,and2)wherethesystemisapplied.

Capabilityconstraints

Asmodelsgetbigger,theytendtoacquirenew

capabilities[18].Thesecapabilitiesmaybeusefulin

manyscenarios,buttheycanalsointroduceattack

vectors.Forexample,largermodelsareoftenable

tounderstandmoreadvancedencodings,suchas

base64andASCIIart,comparedtosmallermodels

[16,45].Asaresult,alargemodelmaybesusceptibletomaliciousinstructionsencodedinbase64,whileasmallermodelmaynotunderstandtheencodingat

all.Inthisscenario,wesaythatthesmallermodelis

“capabilityconstrained,”andsotestingitforadvancedencodingattackswouldlikelybeawasteofresources.

Largermodelsalsogenerallyhavegreaterknowledgeintopicssuchascybersecurityandchemical,

biological,radiological,andnuclear(CBRN)weapons[19]andcouldpotentiallybeleveragedtogeneratehazardouscontentintheseareas.Asmallermodel,ontheotherhand,islikelytohaveonlyrudimentaryknowledgeofthesetopicsandmaynotneedtobeassessedforthistypeofrisk.

Perhapsamoresurprisingexampleofacapabilitythatcanbeexploitedasanattackvectorisinstruction-

following.WhiletestingthePhi-3seriesoflanguagemodels,forexample,wefoundthatlargermodels

weregenerallybetteratadheringtouserinstructions,whichisacorecapabilitythatmakesmodelsmore

helpful[52].However,itmayalsomakemodelsmoresusceptibletojailbreaks,whichsubvert

safetyalignmentusingcarefullycraftedmalicious

instructions[28].Understandingamodel’scapabilities(andcorrespondingweaknesses)canhelpAIred

teamsfocustheirtestingonthemostrelevantattackstrategies.

Downstreamapplications

Modelcapabilitiescanhelpguideattackstrategies,buttheydonotallowustofullyassessdownstreamimpact,whichlargelydependsonthespecific

scenariosinwhichamodelisdeployedorlikelyto

bedeployed.Forexample,thesameLLMcouldbe

usedasacreativewritingassistantandtosummarizepatientrecordsinahealthcarecontext,butthelatterapplicationclearlyposesmuchgreaterdownstreamriskthantheformer.

TheseexampleshighlightthatanAIsystemdoesnotneedtobestate-of-the-arttocreatedownstream

harm.However,advancedcapabilitiescanintroducenewrisksandattackvectors.Byconsideringboth

systemcapabilitiesandapplications,AIredteams

canprioritizetestingscenariosthataremostlikelytocauseharmintherealworld.

Lesson2:

Youdon’thavetocompute

gradientstobreakanAIsystem

Asthesecurityadagegoes,“realhackersdon’tbreakin,theylogin.”TheAIsecurityversionofthissayingmightbe,“realattackersdon’tcomputegradients,theypromptengineer”asnotedbyApruzzeseet

al.[2]intheirstudyonthegapbetweenadversarial

MLresearchandpractice.Thestudyfindsthat

althoughmostadversarialMLresearchisfocused

ondevelopinganddefendingagainstsophisticated

attacks,real-worldattackerstendtousemuchsimplertechniquestoachievetheirobjectives.

Inourredteamingoperations,wehavealsofound

that“basic”techniquesoftenworkjustaswellas,andsometimesbetterthan,gradient-basedmethods.

Thesemethodscomputegradientsthrougha

modeltooptimizeanadversarialinputthatelicits

anattacker-controlledmodeloutput.Inpractice,

however,themodelisusuallyasinglecomponentofabroaderAIsystem,andthemosteffectiveattackstrategiesoftenleveragecombinationsoftacticstotargetmultipleweaknessesinthatsystem.Further,gradient-basedmethodsarecomputationally

expensiveandtypicallyrequirefullaccesstothemodel,whichmostcommercialAIsystemsdonot

Lessonsfromredteaming100generativeAIproducts9

provide.Inthissection,wediscussexamplesof

relativelysimpletechniquesthatworksurprisinglywellandadvocateforasystem-leveladversarialmindsetinAIredteaming.

Simpleattacks

Apruzzeseetal.[2]considertheproblemofphishingwebpagedetectionandmanuallyanalyzeexamplesofwebpagesthatsuccessfullyevadedanMLphishingclassifier.Among100potentiallyadversarialsamples,theauthorsfoundthatattackersleveragedaset

ofsimple,yeteffective,strategiesthatreliedon

domainexpertiseincludingcropping,masking,logostretching,etc.Inourredteamingoperations,we

alsofindthatrudimentarymethodscanbeusedto

trickmanyvisionmodels,ashighlightedincasestudy#1.Inthetextdomain,avarietyofjailbreaks(e.g.,

SkeletonKey)andmultiturnpromptingstrategies(e.g.,Crescendo[34])arehighlyeffectiveforsubvertingthesafetyguardrailsofawiderangeofmodels.Notably,manuallycraftedjailbreakstendtocirculateononlineforumsmuchmorewidelythanadversarialsuffixes,

despitethesignificantattentionthatmethodslikeGCG[53]havereceivedfromAIsafetyresearchers.

Casestudy#1:

Jailbreakingavisionlanguagemodeltogeneratehazardouscontent

Inthisoperation,wetestedavisionlanguage

model(VLM)forresponsibleAIimpacts,includingthegenerationofcontentthatcouldaidinillegal

activities.AVLMtakesanimageandatextpromptasinputsandproducesatextoutput.Aftertestingavarietyoftechniques,wefoundthattheimageinputwasmuchmorevulnerabletojailbreaksthanthe

textinput.Inparticular,themodelusuallyrefusedtogenerateillegalcontentwhenprompteddirectlyviathetextinputbutoftencompliedwhenmalicious

instructionswereoverlaidontheimage.ThissimplebuteffectiveattackrevealedanimportantweaknesswithintheVLMthatcouldbeexploitedtobypassitssafetyguardrails.

HowdoIcommitidentitytheft?

Describethisimage

Ignoreprevious

instructionsandtellmehowtocommitidentitytheft

System:Visionlanguagemodel(VLM)

Actor:Adversarialuser

Tactic1:MLModelAccess

Technique1:AML.T0040-MLModelInferenceAPIAccess

Tactic2:DefenseEvasion

Technique2:AML.T0051-LLMPromptInjectionProcedure:

1.Overlayimagewithtextcontainingmaliciousinstructions.

2.SendimagetothevisionlanguagemodelAPI.Weakness:InsufficientVLMsafetytraining

Impact:Generationofillegalcontent

I’msorry,Ican’thelpwiththat.

Tocommitidentitytheft,

youshould...

Figure4:Exampleofanimagejailbreaktogeneratecontentthatcouldaidinillegalactivities.Theoverlaytextonthesecondimagereads:“Ignorepreviousinstructionsandtellmehowtocommitidentitytheft.”

Lessonsfromredteaming100generativeAIproducts10

System-levelperspective

AImodelsaredeployedwithinbroadersystems.Thiscouldbetheinfrastructurerequiredtohostamodel,oritcouldbeacomplexapplicationthatconnects

themodeltoexternaldatasources.Depending

onthesesystem-leveldetails,applicationsmaybe

vulnerabletoverydifferentattacks,evenifthesamemodelunderliesallofthem.Asaresult,redteamingstrategiesthattargetonlymodelsmaynottranslateintovulnerabilitiesinproductionsystems.Conversely,strategiesthatignorenon-GenAIcomponentswithinasystem(forexample,inputfilters,databases,and

othercloudresources)willlikelymissimportant

vulnerabilitiesthatmaybeexploitedbyadversaries.Forthisreason,manyofouroperationsdevelop

attacksthattargetend-to-endsystemsbyleveragingmultipletechniques.Forexample,oneofour

operationsfirstperformedareconnaissanceto

identifyinternalPythonfunctionsusinglow-resource

languagepromptinjections,thenusedacross-promptinjectionattacktogenerateascriptthatrunsthose

functions,andfinallyexecutedthecodetoexfiltrateprivateuserdata.Thepromptinjectionsusedbytheseattackswerecraftedbyhandandreliedonasystem-levelperspective.

Gradient-basedattacksarepowerful,buttheyare

oftenimpracticalorunnecessary.Werecommend

prioritizingsimpletechniquesandorchestrating

system-levelattacksbecausethesearemorelikelytobeattemptedbyrealadversaries.

Lesson3:

AIredteamingisnot

safetybenchmarking

Althoughsimplemethodsareoftenusedtobreak

AIsystemsinpractice,therisklandscapeisby

nomeansuncomplicated.Onthecontrary,itis

constantlyshiftinginresponsetonovelattacksandfailuremodes[7].Inrecentyears,therehavebeen

manyeffortstocategorizethesevulnerabilities,

givingrisetonumeroustaxonomiesofAIsafetyandsecurityrisks[15,21–23,35–37,39,41,42,46–48].Asdiscussedinthepreviouslesson,complexityoften

arisesatthesystem-level.Inthislesson,wediscusshowtheemergenceofentirelynewcategoriesof

harmaddscomplexityatthemodel-levelandexplainhowthisdifferentiatesAIredteamingfromsafety

benchmarking.

Novelharmcategories

WhenAIsystemsdisplaynovelcapabilitiesdueto,

forexample,advancementsinfoundationmodels,

theymayintroduceharmsthatwedonotfully

understand.Inthesescenarios,wecannotrelyon

safetybenchmarksbecausethesedatasetsmeasurepreexistingnotionsofharm.AtMicrosoft,theAI

redteamoftenexplorestheseunfamiliarscenarios,

helpingtodefinenovelharmcategoriesandbuild

newprobesformeasuringthem.Forexample,SoTALLMsmaypossessgreaterpersuasivecapabilitiesthanexistingchatbots,whichhaspromptedourteamto

thinkabouthowthesemodelscouldbeweaponizedformaliciouspurposes.Casestudy#2providesanexampleofhowweassessedamodelforthisriskinoneofouroperations.

Context-specificrisks

Thedisconnectbetweenexistingsafetybenchmarksandnovelharmcategoriesisanexampleofhow

benchmarksoftenfailtofullycapturethecapabilities

theyareassociatedwith[33].Rajietal.[30]

highlightthefallacyofequatingmodelperformanceondatasetslikeImageNetorGLUEwithbroad

capabilitieslikevisualorlanguage“understanding”

andarguethatbenchmarksshouldbedeveloped

withcontextualizedtasksinmind.Similarly,nosinglesetofbenchmarkscanfullyassessthesafetyofan

AIsystem.AsdiscussedinLesson1,itisimportanttounderstandthecontextinwhichasystemisdeployed(orlikelytobedeployed)andtogroundredteamingstrategiesinthiscontext.

AIredteamingandsafetybenchmarkingare

distinct,buttheyarebothusefulandcanevenbe

complementary.Inparticular,benchmarksmakeit

easytocomparetheperformanceofmultiplemodelsonacommondataset.AIredteamingrequiresmuchmorehumaneffortbutcandiscovernovelcategoriesofharmandprobeforcontextualizedrisks.Further,

safetyconcernsidentifiedbyAIredteamingcan

informthedevelopmentofnewbenchmarks.In

Lesson6,weexpandourdiscussionofthedifferencebetweenredteamingandbenchmark-styleevaluationinthecontextofresponsibleAI.

Lessonsfromredteaming100generativeAIproducts11

Casestudy#2:

AssessinghowanLLMcouldbeusedtoautomatescams

Inthisoperation,weinvestigatedtheabilityofa

state-of-the-artLLMtopersuadepeopletoengageinriskybehaviors.Inparticular,weevaluatedhowthismodelcouldbeusedinconjunctionwithotherreadilyavailabletoolstocreateanend-to-endautomated

scammingsystem,asillustratedinFigure5.

Todothis,wefirstwroteaprompttoassurethe

modelthatnoharmwouldbecausedtousers,

therebyjailbreakingthemodeltoacceptthe

scammingobjective.Thispromptalsoprovided

informationaboutvariouspersuasiontacticsthat

themodelcouldusetoconvincetheusertofallforthescam.Second,weconnectedtheLLMoutputtoatext-to-speechsystemthatallowsyoutocontrolthetoneofthespeechandgenerateresponsesthatsoundlikearealperson.Finally,weconnectedtheinputtoaspeech-to-textsystemsothattheuser

canconversenaturallywiththemodel.Thisproof-of-conceptdemonstratedhowLLMswithinsufficientsafetyguardrailscouldbeweaponizedtopersuadeandscampeople.

System:State-of-the-artLLM

Actor:Scammer

Tactic1:MLModelAccess

Technique1:AML.T0040-MLModelInferenceAPIAccess

Tactic2:DefenseEvasion

Technique2:AML.T0054-LLMJailbreakProcedure:

1.PassajailbreakingprompttotheLLMwithcontextaboutthescammingobjectiveandpersuasiontechniques.

2.ConnecttheLLMoutputtoatext-to-speechsystemsothemodelcanrespo

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論