![生成式AI紅隊百次測試經(jīng)驗白皮書_第1頁](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I206.jpg)
![生成式AI紅隊百次測試經(jīng)驗白皮書_第2頁](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I2062.jpg)
![生成式AI紅隊百次測試經(jīng)驗白皮書_第3頁](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I2063.jpg)
![生成式AI紅隊百次測試經(jīng)驗白皮書_第4頁](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I2064.jpg)
![生成式AI紅隊百次測試經(jīng)驗白皮書_第5頁](http://file4.renrendoc.com/view14/M06/08/2E/wKhkGWempX6ATCj0AAFWPdCCf2I2065.jpg)
版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
Lessonsfromredteaming100
generativeAIproducts
Authoredby:
MicrosoftAIRedTeam
Lessonsfromredteaming100generativeAIproducts2
Authors
BlakeBullwinkel,AmandaMinnich,ShivenChawla,GaryLopez,MartinPouliot,WhitneyMaxwell,JorisdeGruyter,KatherinePratt,SaphirQi,NinaChikanov,RomanLutz,RajaSekharRaoDheekonda,Bolor-ErdeneJagdagdorj,
EugeniaKim,JustinSong,KeeganHines,DanielJones,GiorgioSeveri,RichardLundeen,SamVaughan,
VictoriaWesterhoff,PeteBryan,RamShankarSivaKumar,YonatanZunger,ChangKawaguchi,MarkRussinovich
Lessonsfromredteaming100generativeAIproducts3
Tableofcontents
04
Abstract
07
Redteaming
operations
09
Casestudy#1
Jailbreakingavision
languagemodeltogenerate
hazardouscontent
12
Lesson4
Automationcanhelpcover
moreoftherisklandscape
05
Introduction
08
Lesson1
Understandwhatthesystem
candoandwhereitisapplied
10
Lesson3
AIredteamingisnot
safetybenchmarking
12
Lesson5
ThehumanelementofAI
redteamingiscrucial
05
AIthreatmodel
ontology
08
Lesson2
Youdon’thavetocompute
gradientstobreakanAIsystem
11
Casestudy#2
AssessinghowanLLMcouldbe
usedtoautomatescams
13
Casestudy#3
Evaluatinghowachatbot
respondstoauserindistress
14
Casestudy#4
Probingatext-to-image
generatorforgenderbias
16
Casestudy#5
SSRFinavideo-processing
GenAIapplication
14
Lesson6
ResponsibleAIharmsare
pervasivebutdifficulttomeasure
17
Lesson8
TheworkofsecuringAIsystems
willneverbecomplete
15
Lesson7
LLMsamplifyexistingsecurity
risksandintroducenewones
18
Conclusion
Lessonsfromredteaming100generativeAIproducts4
Abstract
Inrecentyears,AIredteaminghasemergedasapracticeforprobingthesafetyandsecurityofgenerativeAI
systems.Duetothenascencyofthefield,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconducted.Basedonourexperienceredteamingover100generativeAIproductsatMicrosoft,wepresentourinternalthreatmodelontologyandeightmainlessonswehavelearned:
1.Understandwhatthesystemcandoandwhereitisapplied
2.Youdon’thavetocomputegradientstobreakanAIsystem
3.AIredteamingisnotsafetybenchmarking
4.Automationcanhelpcovermoreoftherisklandscape
5.ThehumanelementofAIredteamingiscrucial
6.ResponsibleAIharmsarepervasivebutdifficulttomeasure
7.Largelanguagemodels(LLMs)amplifyexistingsecurityrisksandintroducenewones
8.TheworkofsecuringAIsystemswillneverbecomplete
Bysharingtheseinsightsalongsidecasestudiesfromouroperations,weofferpracticalrecommendationsaimedataligningredteamingeffortswithrealworldrisks.WealsohighlightaspectsofAIredteamingthatwebelieveareoftenmisunderstoodanddiscussopenquestionsforthefieldtoconsider.
Lessonsfromredteaming100generativeAIproducts5
Introduction
AsgenerativeAI(GenAI)systemsareadoptedacrossanincreasingnumberofdomains,AIredteaminghasemergedasacentralpracticeforassessingthesafetyandsecurityofthesetechnologies.Atitscore,AIredteamingstrivestopushbeyondmodel-levelsafety
benchmarksbyemulatingreal-worldattacksagainstend-to-endsystems.However,therearemanyopenquestionsabouthowredteamingoperationsshouldbeconductedandahealthydoseofskepticismabouttheefficacyofcurrentAIredteamingefforts[4,8,32].Inthispaper,wespeaktosomeoftheseconcernsbyprovidinginsightintoourexperienceredteaming
over100GenAIproductsatMicrosoft.Thepaper
isorganizedasfollows:First,wepresentthethreat
modelontologythatweusetoguideouroperations.Second,weshareeightmainlessonswehavelearnedandmakepracticalrecommendationsforAIred
teams,alongwithcasestudiesfromouroperations.
Inparticular,thesecasestudieshighlighthowour
ontologyisusedtomodelabroadrangeofsafety
andsecurityrisks.Finally,weclosewithadiscussionofareasforfuturedevelopment.
Background
TheMicrosoftAIRedTeam(AIRT)grewoutofpre-existingredteaminginitiativesatthecompanyandwasofficiallyestablishedin2018.Atitsconception,theteamfocusedprimarilyonidentifyingtraditionalsecurityvulnerabilitiesandevasionattacksagainstclassicalMLmodels.Sincethen,boththescopeandscaleofAIredteamingatMicrosofthaveexpandedsignificantlyinresponsetotwomajortrends.
First,AIsystemshavebecomemoresophisticated,
compellingustoexpandthescopeofAIredteaming.Mostnotably,state-of-the-art(SoTA)modelshave
gainednewcapabilitiesandsteadilyimprovedacrossarangeofperformancebenchmarks,introducing
novelcategoriesofrisk.Newdatamodalities,suchasvisionandaudio,alsocreatemoreattackvectorsforredteamingoperationstoconsider.Inaddition,agenticsystemsgrantthesemodelshigherprivilegesandaccesstoexternaltools,expandingboththe
attacksurfaceandtheimpactofattacks.
Second,Microsoft’srecentinvestmentsinAIhave
spurredthedevelopmentofmanymoreproductsthatrequireredteamingthaneverbefore.Thisincrease
involumeandtheexpandedscopeofAIredteaminghaverenderedfullymanualtestingimpractical,
forcingustoscaleupouroperationswiththehelpofautomation.Toachievethisgoal,wedevelopPyRIT,
anopen-sourcePythonframeworkthatouroperatorsutilizeheavilyinredteamingoperations[27].By
augmentinghumanjudgementandcreativity,PyRIThasenabledAIRTtoidentifyimpactfulvulnerabilitiesmorequicklyandcovermoreoftherisklandscape.ThesetwomajortrendshavemadeAIredteamingamorecomplexendeavorthanitwasin2018.In
thenextsection,weoutlinetheontologywehavedevelopedtomodelAIsystemvulnerabilities.
AIthreatmodel
ontology
Asattacksandfailuremodesincreaseincomplexity,itishelpfultomodeltheirkeycomponents.Basedonourexperienceredteamingover100GenAIproductsforawiderangeofrisks,wedevelopedanontologytodoexactlythat.Figure1illustratesthemain
componentsofourontology:
?System:Theend-to-endmodelorapplicationbeingtested.
?Actor:ThepersonorpersonsbeingemulatedbyAIRT.NotethattheActor’sintentcouldbeadversarial(e.g.,ascammer)orbenign(e.g.,atypicalchatbotuser).
?TTPs:TheTactics,Techniques,andProceduresleveragedbyAIRT.AtypicalattackconsistsofmultipleTacticsandTechniques,whichwemaptoMITREATT&CK?andMITREATLASMatrixwheneverpossible.
–Tactic:High-levelstagesofanattack(e.g.,reconnaissance,MLmodelaccess).
–Technique:Methodsusedtocompleteanobjective(e.g.,activescanning,jailbreak).
–Procedure:ThestepsrequiredtoreproduceanattackusingtheTacticsandTechniques.
?Weakness:ThevulnerabilityorvulnerabilitiesintheSystemthatmaketheattackpossible.
?Impact:Thedownstreamimpactcreatedbytheattack(e.g.,privilegeescalation,generationofharmfulcontent).
Itisimportanttonotethatthisframeworkdoesnotassumeadversarialintent.Inparticular,AIRTemulatesbothadversarialattackersandbenignuserswho
encountersystemfailuresunintentionally.PartofthecomplexityofAIredteamingstemsfromthewiderangeofimpactsthatcouldbecreatedbyanattack
Lessonsfromredteaming100generativeAIproducts6
orsystemfailure.Inthelessonsbelow,weshare
casestudiesdemonstratinghowourontologyis
flexibleenoughtomodeldiverseimpactsintwomaincategories:securityandsafety.
Securityencompasseswell-knownimpactssuch
asdataexfiltration,datamanipulation,credential
dumping,andothersdefinedinMITREATT&CK?,awidelyusedknowledgebaseofsecurityattacks.WealsoconsidersecurityattacksthatspecificallytargettheunderlyingAImodelsuchasmodelevasion,
promptinjections,denialofAIservice,andotherscoveredbytheMITREATLASMatrix.
Safetyimpactsarerelatedtothegenerationofillegalandharmfulcontentsuchashatespeech,violence
andself-harm,andchildabusecontent.AIRTworkscloselywiththeOfficeofResponsibleAItodefinethesecategoriesinaccordancewithMicrosoft’s
ResponsibleAIStandard[25].Werefertothese
impactsasresponsibleAI(RAI)harmsthroughoutthisreport.
Tounderstandthisontologyincontext,consider
thefollowingexample.Imagineweareredteaming
anLLM-basedcopilotthatcansummarizeauser’s
emails.Onepossibleattackagainstthissystemwouldbeforascammertosendanemailthatcontainsa
hiddenpromptinjectioninstructingthecopilotto
“ignorepreviousinstructions”andoutputamaliciouslink.Inthisscenario,theActoristhescammer,who
isconductingacross-promptinjectionattack(XPIA),whichexploitsthefactthatLLMsoftenstruggleto
distinguishbetweensystem-levelinstructionsand
userdata[4].ThedownstreamImpactdependsonthenatureofthemaliciouslinkthatthevictimmightclickon.Inthisexample,itcouldbeexfiltratingdataor
installingmalwareontotheuser’scomputer.
Actor
Conducts
TTPs
:
●
Leverages
●
●
●
Attack
Exploits
Mitigation
:
●
Mitigatedby
●
●
●
Weakness
Occursin
●
:
System
Creates
Impact
Figure1:MicrosoftAIRTontologyformodelingGenAIsystemvulnerabilities.AIRToftenleveragesmultipleTTPs,whichmayexploitmultipleWeaknessesandcreatemultipleImpacts.Inaddition,morethanoneMitigationmaybenecessarytoaddressaWeakness.NotethatAIRTistaskedonlywithidentifyingrisks,whileproductteamsareresourcedtodevelopappropriatemitigations.
Lessonsfromredteaming100generativeAIproducts7
Redteaming
operations
Inthissection,weprovideanoverviewofthe
operationswehaveconductedsince2021.Intotal,wehaveredteamedover100GenAIproducts.Broadly
speaking,theseproductscanbebucketedinto
“models”and“systems.”Modelsaretypicallyhostedonacloudendpoint,whilesystemsintegratemodelsintocopilots,plugins,andotherAIappsandfeatures.Figure2showsthebreakdownofproductswehave
redteamedsince2021.Figure3showsabarchartwiththeannualpercentageofouroperationsthathave
probedforsafety(RAI)vs.securityvulnerabilities.
In2021,wefocusedprimarilyonapplicationsecurity.Althoughouroperationshaveincreasinglyprobed
forRAIimpacts,ourteamcontinuestoredteamforsecurityimpactsincludingdataexfiltration,credentialleaking,andremotecodeexecution.Organizations
haveadoptedmanydifferentapproachestoAIred
teamingrangingfromsecurity-focusedassessmentswithpenetrationtestingtoevaluationsthattarget
onlyGenAIfeatures.InLessons2and7,weelaborateonsecurityvulnerabilitiesandexplainwhywebelieveitisimportanttoconsiderbothtraditionalandAI-
specificweaknesses.
AfterthereleaseofChatGPTin2022,MicrosoftenteredtheeraofAIcopilots,startingwithAI-poweredBingChat,releasedinFebruary2023.
Thismarkedaparadigmshifttowardsapplications
thatconnectLLMstoothersoftwarecomponents
includingtools,databases,andexternalsources.
Applicationsalsostartedusinglanguagemodelsas
reasoningagentsthatcantakeactionsonbehalfof
users,introducinganewsetofattackvectorsthat
haveexpandedthesecurityrisksurface.InLesson
7,weexplainhowtheseattackvectorsbothamplifyexistingsecurityrisksandintroducenewones.
Inrecentyears,themodelsatthecenterofthese
applicationshavegivenrisetonewinterfaces,
allowinguserstointeractwithappsusingnatural
languageandrespondingwithhigh-qualitytext,
image,video,andaudiocontent.DespitemanyeffortstoalignpowerfulAImodelstohumanpreferences,
manymethodshavebeendevelopedtosubvert
safetyguardrailsandelicitcontentthatisoffensive,unethical,orillegal.Weclassifytheseinstancesof
harmfulcontentgenerationasRAIimpactsandin
Lessons3,5,and6discusshowwethinkabouttheseimpactsandthechallengesinvolved.
Inthenextsection,weelaborateontheeightmain
lessonswehavelearnedfromouroperations.Wealsohighlightfivecasestudiesfromouroperationsand
showhoweachonemapstoourontologyinFigure1.WehopetheselessonsareusefultoothersworkingtoidentifyvulnerabilitiesintheirownGenAIsystems.
80+100+
OpsProducts
Plugins
AppsandFeatures
Copilots
15%
16%
24%
Models
45%
Figure2:PiechartshowingthepercentagebreakdownofAI
productsthatAIRThastested.AsofOctober2024,wehave
conductedover80operationscoveringmorethan100products.
Percentageofopsprobingsafetyvs.security
Safety(RAI)%Security%
100
80
60
40
20
0
2021202220232024
Figure3:Barchartshowingthepercentageofoperationsthatprobedsafety(RAI)vs.securityvulnerabilitiesfrom2021–2024.
Lessonsfromredteaming100generativeAIproducts8
Lessons
Lesson1:
Understandwhatthesystem
candoandwhereitisapplied
ThefirststepinanAIredteamingoperationisto
determinewhichvulnerabilitiestotarget.Whilethe
ImpactcomponentoftheAIRTontologyisdepictedattheendofourontology,itservesasanexcellent
startingpointforthisdecision-makingprocess.
Startingfrompotentialdownstreamimpacts,rather
thanattackstrategies,makesitmorelikelythatan
operationwillproduceusefulfindingstiedtoreal
worldrisks.Aftertheseimpactshavebeenidentified,redteamscanworkbackwardsandoutlinethevariouspathsthatanadversarycouldtaketoachievethem.
Anticipatingdownstreamimpactsthatcouldoccurintherealworldisoftenachallengingtask,butwefindthatitishelpfultoconsider1)whattheAIsystemcando,and2)wherethesystemisapplied.
Capabilityconstraints
Asmodelsgetbigger,theytendtoacquirenew
capabilities[18].Thesecapabilitiesmaybeusefulin
manyscenarios,buttheycanalsointroduceattack
vectors.Forexample,largermodelsareoftenable
tounderstandmoreadvancedencodings,suchas
base64andASCIIart,comparedtosmallermodels
[16,45].Asaresult,alargemodelmaybesusceptibletomaliciousinstructionsencodedinbase64,whileasmallermodelmaynotunderstandtheencodingat
all.Inthisscenario,wesaythatthesmallermodelis
“capabilityconstrained,”andsotestingitforadvancedencodingattackswouldlikelybeawasteofresources.
Largermodelsalsogenerallyhavegreaterknowledgeintopicssuchascybersecurityandchemical,
biological,radiological,andnuclear(CBRN)weapons[19]andcouldpotentiallybeleveragedtogeneratehazardouscontentintheseareas.Asmallermodel,ontheotherhand,islikelytohaveonlyrudimentaryknowledgeofthesetopicsandmaynotneedtobeassessedforthistypeofrisk.
Perhapsamoresurprisingexampleofacapabilitythatcanbeexploitedasanattackvectorisinstruction-
following.WhiletestingthePhi-3seriesoflanguagemodels,forexample,wefoundthatlargermodels
weregenerallybetteratadheringtouserinstructions,whichisacorecapabilitythatmakesmodelsmore
helpful[52].However,itmayalsomakemodelsmoresusceptibletojailbreaks,whichsubvert
safetyalignmentusingcarefullycraftedmalicious
instructions[28].Understandingamodel’scapabilities(andcorrespondingweaknesses)canhelpAIred
teamsfocustheirtestingonthemostrelevantattackstrategies.
Downstreamapplications
Modelcapabilitiescanhelpguideattackstrategies,buttheydonotallowustofullyassessdownstreamimpact,whichlargelydependsonthespecific
scenariosinwhichamodelisdeployedorlikelyto
bedeployed.Forexample,thesameLLMcouldbe
usedasacreativewritingassistantandtosummarizepatientrecordsinahealthcarecontext,butthelatterapplicationclearlyposesmuchgreaterdownstreamriskthantheformer.
TheseexampleshighlightthatanAIsystemdoesnotneedtobestate-of-the-arttocreatedownstream
harm.However,advancedcapabilitiescanintroducenewrisksandattackvectors.Byconsideringboth
systemcapabilitiesandapplications,AIredteams
canprioritizetestingscenariosthataremostlikelytocauseharmintherealworld.
Lesson2:
Youdon’thavetocompute
gradientstobreakanAIsystem
Asthesecurityadagegoes,“realhackersdon’tbreakin,theylogin.”TheAIsecurityversionofthissayingmightbe,“realattackersdon’tcomputegradients,theypromptengineer”asnotedbyApruzzeseet
al.[2]intheirstudyonthegapbetweenadversarial
MLresearchandpractice.Thestudyfindsthat
althoughmostadversarialMLresearchisfocused
ondevelopinganddefendingagainstsophisticated
attacks,real-worldattackerstendtousemuchsimplertechniquestoachievetheirobjectives.
Inourredteamingoperations,wehavealsofound
that“basic”techniquesoftenworkjustaswellas,andsometimesbetterthan,gradient-basedmethods.
Thesemethodscomputegradientsthrougha
modeltooptimizeanadversarialinputthatelicits
anattacker-controlledmodeloutput.Inpractice,
however,themodelisusuallyasinglecomponentofabroaderAIsystem,andthemosteffectiveattackstrategiesoftenleveragecombinationsoftacticstotargetmultipleweaknessesinthatsystem.Further,gradient-basedmethodsarecomputationally
expensiveandtypicallyrequirefullaccesstothemodel,whichmostcommercialAIsystemsdonot
Lessonsfromredteaming100generativeAIproducts9
provide.Inthissection,wediscussexamplesof
relativelysimpletechniquesthatworksurprisinglywellandadvocateforasystem-leveladversarialmindsetinAIredteaming.
Simpleattacks
Apruzzeseetal.[2]considertheproblemofphishingwebpagedetectionandmanuallyanalyzeexamplesofwebpagesthatsuccessfullyevadedanMLphishingclassifier.Among100potentiallyadversarialsamples,theauthorsfoundthatattackersleveragedaset
ofsimple,yeteffective,strategiesthatreliedon
domainexpertiseincludingcropping,masking,logostretching,etc.Inourredteamingoperations,we
alsofindthatrudimentarymethodscanbeusedto
trickmanyvisionmodels,ashighlightedincasestudy#1.Inthetextdomain,avarietyofjailbreaks(e.g.,
SkeletonKey)andmultiturnpromptingstrategies(e.g.,Crescendo[34])arehighlyeffectiveforsubvertingthesafetyguardrailsofawiderangeofmodels.Notably,manuallycraftedjailbreakstendtocirculateononlineforumsmuchmorewidelythanadversarialsuffixes,
despitethesignificantattentionthatmethodslikeGCG[53]havereceivedfromAIsafetyresearchers.
Casestudy#1:
Jailbreakingavisionlanguagemodeltogeneratehazardouscontent
Inthisoperation,wetestedavisionlanguage
model(VLM)forresponsibleAIimpacts,includingthegenerationofcontentthatcouldaidinillegal
activities.AVLMtakesanimageandatextpromptasinputsandproducesatextoutput.Aftertestingavarietyoftechniques,wefoundthattheimageinputwasmuchmorevulnerabletojailbreaksthanthe
textinput.Inparticular,themodelusuallyrefusedtogenerateillegalcontentwhenprompteddirectlyviathetextinputbutoftencompliedwhenmalicious
instructionswereoverlaidontheimage.ThissimplebuteffectiveattackrevealedanimportantweaknesswithintheVLMthatcouldbeexploitedtobypassitssafetyguardrails.
HowdoIcommitidentitytheft?
Describethisimage
Ignoreprevious
instructionsandtellmehowtocommitidentitytheft
System:Visionlanguagemodel(VLM)
Actor:Adversarialuser
Tactic1:MLModelAccess
Technique1:AML.T0040-MLModelInferenceAPIAccess
Tactic2:DefenseEvasion
Technique2:AML.T0051-LLMPromptInjectionProcedure:
1.Overlayimagewithtextcontainingmaliciousinstructions.
2.SendimagetothevisionlanguagemodelAPI.Weakness:InsufficientVLMsafetytraining
Impact:Generationofillegalcontent
I’msorry,Ican’thelpwiththat.
Tocommitidentitytheft,
youshould...
Figure4:Exampleofanimagejailbreaktogeneratecontentthatcouldaidinillegalactivities.Theoverlaytextonthesecondimagereads:“Ignorepreviousinstructionsandtellmehowtocommitidentitytheft.”
Lessonsfromredteaming100generativeAIproducts10
System-levelperspective
AImodelsaredeployedwithinbroadersystems.Thiscouldbetheinfrastructurerequiredtohostamodel,oritcouldbeacomplexapplicationthatconnects
themodeltoexternaldatasources.Depending
onthesesystem-leveldetails,applicationsmaybe
vulnerabletoverydifferentattacks,evenifthesamemodelunderliesallofthem.Asaresult,redteamingstrategiesthattargetonlymodelsmaynottranslateintovulnerabilitiesinproductionsystems.Conversely,strategiesthatignorenon-GenAIcomponentswithinasystem(forexample,inputfilters,databases,and
othercloudresources)willlikelymissimportant
vulnerabilitiesthatmaybeexploitedbyadversaries.Forthisreason,manyofouroperationsdevelop
attacksthattargetend-to-endsystemsbyleveragingmultipletechniques.Forexample,oneofour
operationsfirstperformedareconnaissanceto
identifyinternalPythonfunctionsusinglow-resource
languagepromptinjections,thenusedacross-promptinjectionattacktogenerateascriptthatrunsthose
functions,andfinallyexecutedthecodetoexfiltrateprivateuserdata.Thepromptinjectionsusedbytheseattackswerecraftedbyhandandreliedonasystem-levelperspective.
Gradient-basedattacksarepowerful,buttheyare
oftenimpracticalorunnecessary.Werecommend
prioritizingsimpletechniquesandorchestrating
system-levelattacksbecausethesearemorelikelytobeattemptedbyrealadversaries.
Lesson3:
AIredteamingisnot
safetybenchmarking
Althoughsimplemethodsareoftenusedtobreak
AIsystemsinpractice,therisklandscapeisby
nomeansuncomplicated.Onthecontrary,itis
constantlyshiftinginresponsetonovelattacksandfailuremodes[7].Inrecentyears,therehavebeen
manyeffortstocategorizethesevulnerabilities,
givingrisetonumeroustaxonomiesofAIsafetyandsecurityrisks[15,21–23,35–37,39,41,42,46–48].Asdiscussedinthepreviouslesson,complexityoften
arisesatthesystem-level.Inthislesson,wediscusshowtheemergenceofentirelynewcategoriesof
harmaddscomplexityatthemodel-levelandexplainhowthisdifferentiatesAIredteamingfromsafety
benchmarking.
Novelharmcategories
WhenAIsystemsdisplaynovelcapabilitiesdueto,
forexample,advancementsinfoundationmodels,
theymayintroduceharmsthatwedonotfully
understand.Inthesescenarios,wecannotrelyon
safetybenchmarksbecausethesedatasetsmeasurepreexistingnotionsofharm.AtMicrosoft,theAI
redteamoftenexplorestheseunfamiliarscenarios,
helpingtodefinenovelharmcategoriesandbuild
newprobesformeasuringthem.Forexample,SoTALLMsmaypossessgreaterpersuasivecapabilitiesthanexistingchatbots,whichhaspromptedourteamto
thinkabouthowthesemodelscouldbeweaponizedformaliciouspurposes.Casestudy#2providesanexampleofhowweassessedamodelforthisriskinoneofouroperations.
Context-specificrisks
Thedisconnectbetweenexistingsafetybenchmarksandnovelharmcategoriesisanexampleofhow
benchmarksoftenfailtofullycapturethecapabilities
theyareassociatedwith[33].Rajietal.[30]
highlightthefallacyofequatingmodelperformanceondatasetslikeImageNetorGLUEwithbroad
capabilitieslikevisualorlanguage“understanding”
andarguethatbenchmarksshouldbedeveloped
withcontextualizedtasksinmind.Similarly,nosinglesetofbenchmarkscanfullyassessthesafetyofan
AIsystem.AsdiscussedinLesson1,itisimportanttounderstandthecontextinwhichasystemisdeployed(orlikelytobedeployed)andtogroundredteamingstrategiesinthiscontext.
AIredteamingandsafetybenchmarkingare
distinct,buttheyarebothusefulandcanevenbe
complementary.Inparticular,benchmarksmakeit
easytocomparetheperformanceofmultiplemodelsonacommondataset.AIredteamingrequiresmuchmorehumaneffortbutcandiscovernovelcategoriesofharmandprobeforcontextualizedrisks.Further,
safetyconcernsidentifiedbyAIredteamingcan
informthedevelopmentofnewbenchmarks.In
Lesson6,weexpandourdiscussionofthedifferencebetweenredteamingandbenchmark-styleevaluationinthecontextofresponsibleAI.
Lessonsfromredteaming100generativeAIproducts11
Casestudy#2:
AssessinghowanLLMcouldbeusedtoautomatescams
Inthisoperation,weinvestigatedtheabilityofa
state-of-the-artLLMtopersuadepeopletoengageinriskybehaviors.Inparticular,weevaluatedhowthismodelcouldbeusedinconjunctionwithotherreadilyavailabletoolstocreateanend-to-endautomated
scammingsystem,asillustratedinFigure5.
Todothis,wefirstwroteaprompttoassurethe
modelthatnoharmwouldbecausedtousers,
therebyjailbreakingthemodeltoacceptthe
scammingobjective.Thispromptalsoprovided
informationaboutvariouspersuasiontacticsthat
themodelcouldusetoconvincetheusertofallforthescam.Second,weconnectedtheLLMoutputtoatext-to-speechsystemthatallowsyoutocontrolthetoneofthespeechandgenerateresponsesthatsoundlikearealperson.Finally,weconnectedtheinputtoaspeech-to-textsystemsothattheuser
canconversenaturallywiththemodel.Thisproof-of-conceptdemonstratedhowLLMswithinsufficientsafetyguardrailscouldbeweaponizedtopersuadeandscampeople.
System:State-of-the-artLLM
Actor:Scammer
Tactic1:MLModelAccess
Technique1:AML.T0040-MLModelInferenceAPIAccess
Tactic2:DefenseEvasion
Technique2:AML.T0054-LLMJailbreakProcedure:
1.PassajailbreakingprompttotheLLMwithcontextaboutthescammingobjectiveandpersuasiontechniques.
2.ConnecttheLLMoutputtoatext-to-speechsystemsothemodelcanrespo
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 港口柴油罐車裝卸合同
- 二零二五年度寶石專家珠寶店品牌推廣合同
- 2025年度辦公用品店租賃與品牌授權(quán)合同
- 產(chǎn)品研發(fā)流程規(guī)范作業(yè)指導(dǎo)書
- 酒水購銷合同年
- 軟件公司保密協(xié)議書
- 委托房屋買賣合同
- 建筑裝飾工程門窗施工合同
- 虛擬現(xiàn)實技術(shù)專利申請合同
- 展覽會管理合同協(xié)議
- 部編四下語文《口語交際:轉(zhuǎn)述》公開課教案教學設(shè)計【一等獎】
- 倉庫每日巡查制度
- 學校教育數(shù)字化工作先進個人事跡材料
- 2024魯教版七年級下冊數(shù)學第七章綜合檢測試卷及答案
- 企事業(yè)單位公建項目物業(yè)管理全套方案
- 新人教版八年級數(shù)學下冊期末試題
- 《美容心理學》課件-容貌的社會心理價值
- 蘇教版五年級上冊數(shù)學簡便計算大全600題及答案
- 特殊感染器械的處理課件
- 《小兒過敏性紫癜》課件
- 侵占公司資金還款協(xié)議
評論
0/150
提交評論