版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
PortlandStateUniversity
PDXScholar
BusinessFacultyPublicationsand
Presentations
TheSchoolofBusiness
8-2021
UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution
MartaStelmaszakRosa
PortlandStateUniversity,
stmar
ta@
Followthisandadditionalworksat:
/busadmin_fac
Partofthe
BusinessCommons
Letusknowhowaccesstothisdocumentbenefitsyou.
CitationDetails
Stelmaszak,M.(2021)UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution.AmericasConferenceonInformationSystems2021,9-13August2021.
ThisConferenceProceedingisbroughttoyouforfreeandopenaccess.IthasbeenacceptedforinclusioninBusinessFacultyPublicationsandPresentationsbyanauthorizedadministratorofPDXScholar.Pleasecontactusifwecanmakethisdocumentmoreaccessible:
pdxscholar@
.
UnboxingtheAlgorithm:AProcessModel
Twenty-SeventhAmericasConferenceonInformationSystems,Montreal,2021
PAGE
10
UnboxingtheAlgorithm:AProcessModelofanAlgorithmicSolution
CompletedResearch
MartaStelmaszakPortlandStateUniversity
stmarta@
Abstract
Withtheexplosionofdata,analyticsandartificialintelligence,informationsystemsresearchfocusesontheuse,managementandconsequencesofalgorithms.Thisfar,onlyahandfulofpapersofferinsightsintohowalgorithmicsolutionswork.Toaddressthisgap,westudiedthecodemakingup45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonadatascienceplatformK.Wesynthesizedaprocessmodelofanalgorithmicsolution:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingmodel.Unboxingthealgorithmandinvestigatingtheprocessoffersamorefine-tunedunderstandingandlanguagetobetterconceptualizetheuse,managementandconsequencesofalgorithmicsolutions.Italsoprovidesascaffoldingforresearchintothedevelopmentofalgorithmicsolutions,highlightingtheirvariability,experimentationanddatascientistdecisions.
Keywords
Algorithms,algorithmicsolutions,datascience,informationsystemsdevelopment,processmodel
Introduction
Algorithmshave,withoutadoubt,attractedresearchattentionacrossanumberoffields,frommediastudies,throughsociology,tocomputerscience.Managementandinformationsystems(IS)researchersstudyalgorithmicsolutionsprimarilyintermsoftheiruse,managementandconsequencesforindividualsinworkcontexts,inorganizations,andinthewidersociety(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,thisresearchcangreatlybenefitfromanimprovedunderstandingofhowalgorithmicsolutionsaredeveloped,andthustherehavebeencallstofocusmoreontheorizingtheirdevelopment(vandenBroeketal.2021).Thisfar,onlyahandfulofpapersinISofferinsightsintohowalgorithmicsolutionsworkwhichisanessentiallinkbetweenunderstandingtheiruseandtheirdevelopment.Againstthisbackground,thisstudyaimstoanswerasimplequestion:whatistheprocessofmakinganalgorithmicsolutionwork?
Touncoverthebuildingblocksandproposeaprocessmodel,westudied45publicdatascienceJupyternotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK(Dissanayakeetal.2015;MangalandKumar2016).Referringtoacommonproblemfacedbymanycompaniesandoftentackledbyalgorithmicsolutions,thecreditcarddatasetattractedover200notebookswithcodeandcommentsdescribingattemptstobestpredictcustomerchurn.Weselected35ofthebest-regardednotebooks,downloadedthemandcodedthemusingagroundedtheoryapproach(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006).Wethengroupedthethemestoidentifytheelementsthatmadeupeachproposedalgorithmicsolutionanddistilledaprocessmodelofhowtheyweredeveloped.
Basedonourfindings,weproposeaprocessmodelofmakinganalgorithmicsolutionworkencompassing:preparingtheenvironment,readingindata,cleaningdata,exploratorydataanalysis,pre-processingthedataset,buildingandtrainingthemodel,andtestingandvalidatingthemodel.Wecontributetoinformationsystemsandmanagementliteraturebydevelopingaprocessmodelofanalgorithmicsolution
thatoffersamorefine-tunedlanguagetoinvestigatenotonlytheuse,managementandconsequencesofalgorithmicsolutionsonindividuals,organizationsandsocieties,butalsoenablesafurtherstudyofthedesignanddevelopmentofsuchsolutionsfromasocio-technicalperspective.
MakingAlgorithmicSolutionsWork
Recenttechnological(processingcapabilities,bigdata,machinelearning),societal(useofsmartphones,attitudestowardsdata,socialmedia)andorganizational(phantomization,networks)developmentscontributedtothegrowthinuseofvariousalgorithms(Baptistaetal.2017;Berenteetal.2019).ISresearchinthesocio-technicaltraditionhasthusfocusedonthestudyoftheuse,managementandconsequencesofalgorithmsonindividual,organizationalandsocietallevels(Galliersetal.2017;Markus2017;NewellandMarabelli2015).However,farlessattentionhasbeenpaidsofartotheunderstandingofhowalgorithmsandalgorithmicsolutionsbasedonthemaredeveloped(vandenBroeketal.2021).Firstpapersbegintouncoverhowdatascientistsandsubjectmatterexpertsneedtoworktogetherinthedevelopmentprocess(vandenBroeketal.2021),howthepracticesofdatascientistsinthebankingindustryrelyonbothsubjectivityandobjectivityintheproductionofinformation(Joshi2020),andhowdatascientistsengageinthepracticesofknowledgehiding(GhasemaghaeiandTurel2021).Inotherwords,whilefocusingpredominantlyonwhathappensafterthealgorithmsareputtowork,currentliteratureofferslittleinsightintohowalgorithmsaremadetowork,thatiswhatstepsneedtobeinplaceforanalgorithmicsolutiontoworkeffectively.Suchunderstandingisessentialbecausetheprocessofmakinganalgorithmicsolutionwork,asweshowbelow,determineswhatkindsofinsightsandpredictionsitoffers,thusinfluencingdecisions.
Mostresearcherswhoinbroadstrokesdescribewhatgoesintomakingalgorithmicsolutionsworkintheirpapersrefertocertainaspectswithvaryingconsistency:thefactthatalgorithmsprocessdata(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;Gregoryetal.2020;Gr?nsundandAanestad2020;Lebovitz2020;Lycett2013;NewellandMarabelli2015;Pachidietal.2021;Shresthaetal.2019)inanautomatedorpreprogramedway(Galliersetal.,2017;Gr?nsund&Aanestad,2020;Güntheretal.,2017;Shresthaetal.,2019)tolearnmodels(BalasubramanianandYe2021;Lietal.2019;Shresthaetal.2019)leadingtonewinsights(Güntheretal.,2017;Günther&Joshi,2020;Pachidietal.,2021),decisions(BalasubramanianandYe2021;vandenBroeketal.2020;Galliersetal.2017;NewellandMarabelli2015)orpredictions(Lebovitz2020;Lietal.2019;Shresthaetal.2019).Thisoffersapunctuatedandincompletepictureoftheelementsinvolvedindevelopingalgorithmsthatcanbesubsequentlyusedinbusinesssettings.
Ahandfulofpapersofferinsightsintotheessentialelementsofwhathappensinsidealgorithmicsolutions.Pachidietal.(Pachidietal.2021)provideadetaileddescription,coveringvariouselementsthatareatplayinapredictivemodel:
“Themodelcombinedanumberofinternalandexternaldatasources,suchastimeseriesofcustomertransactions,Nielsenmarketdata,GartnerICTspendingpredictions,financialdata,andusagedata.Theoutputofthemodelwasrepresentedinaspreadsheetformatthatcontainedalistofallmedium-sizedcustomersandpredictionsregardingpotentialsalesopportunities.TheCLMmodelallocatedcustomerstodifferentcustomersegments(A,B,C,D)basedontheirhistoricalandpredictedsaleswithTelCo.ForeachTelCoproductline(e.g.,businesstelephonesystems,mobilephonepackages,fixedlinesetups),theCLMmodelassignedapositioninthecustomersaleslifecycle(inform,specify,sell,maintain),eachofwhichentailedadifferentcontactstrategy.Thus,themodeloutputconsistedofarankingofopportunities,withaprioritizedactionlistforaccountmanagers.”
Gr?nsundandAanestad(2020,p.7)aresimilarlydetailed:
“Thealgorithm-supportedanalysissystemwasdesignedtoautomatebothdataacquisitionandtheprocessingofdataforsubsequentanalysis.Acquisitionofdatawasautomatedbythesystempullingstreamsofdataonshipactivityfromthesatellite-AISdataprovider,alongwithadditionaldatasuchasvesseldescriptionsandgeospatialdata,intoaHadoop-baseddatawarehouserepository.Herethedatawereextractedandconsolidated,thenclassifiedusingrule-basedNLP(NaturalLanguageProcessing)classification,andfinallypresentedinBItoolsthatallowedhumaninterpretationoftheoutput.”
Whilethedescriptionsbothpointtoobtaining,compilingandprocessingofdata,furtheranalysisandclassification,theyrevealdifferencesinhowthesolutionswork,anddonotofferacompletepicture.Takingamoregeneralview,OrlikowskiandScottdefineanalgorithmas“asetofstep-by-stepinstructionstoachieveadesiredresultinafinitenumberofmoves”(2015,p.210).Acknowledgingthismoretraditionaldefinitionofanalgorithm-aprogramcontainingafixedsequenceofinstructionsexecuteduntilasolutionisreached-rootedincomputerscience(HopcroftandUllman1983),Farajetal.(2018)‘update’andbroadenthescopeofthisdefinitionbyconceptualizinglearningalgorithmsas“anemergentfamilyoftechnologiesthatbuildonmachinelearning,computation,andstatisticaltechniques,aswellasrelyonlargedatasetstogenerateresponses,classifications,ordynamicpredictionsthatresemblethoseofaknowledgeworker”(p.62).AsimilardefinitionofartificialintelligencealgorithmsisputforwardbyTarafdaretal.(2020,p.1):“WedefineAIalgorithmsasthosethatextractinsightsandknowledgefrombigdatasources;computationalandstatisticaltechniquessuchasmachinelearning(ML)anddeeplearningembeddedinsuchalgorithms,aimto‘teach’computerstheabilitytododetectpatternsinbigdata”.
Whilethesedefinitionsofferagoodstartingpointandaninitialoverviewoftheelementsintheprocessofmakingalgorithmicsolutionswork,theyarepartialanddivergentintheirfocus.Thesedifferencesinthedefinition,understanding,scopeandscaleofthestepsandelementsrequiredtomakealgorithmsworkhamperthedevelopmentoftheunderstandingoftheuse,managementandconsequencesofalgorithms,andatthesametimemakeuncoveringtheirdevelopmentmoredifficult.ForISresearchtosystematicallyprogressinthisareaitisthusfundamentaltoask:whatistheprocessofmakinganalgorithmicsolutionwork?
ResearchSettingandMethods
Toanswerthisquestion,westudied45publicdatasciencenotebookscontainingalgorithmicsolutionsdevelopedtopredictcustomerchurninacreditcarddatasetonapopulardatascienceandmachinelearningplatformK.Belowwedescribetheresearchsetting,aswellasdatacollectionandanalysismethods.
ResearchSetting
Kisapopularplatformfordatascientistsandmachinelearningengineerswheretheycandevelopandimprovetheirskills,aswellasparticipateincorporate-sponsoredcompetitionsbyaddressingavarietyofproblemsrelatedtodatasetspublishedontheplatform.K,partofAlphabetInc,allowstouploaddatasets,setspecifictasksforthemandcreateinteractiveJypyternotebookswhereuserscandeveloptheiralgorithmicsolutions.Kwasselectedasasettingbecauseofitspublicavailabilityandopennessinsharingnotebooksthatallowsanunprecedentedaccesstotheinnerworkingsofalgorithmicsolutions.OthershaveusedKforresearchpurposesaswell(Dissanayakeetal.2015;MangalandKumar2016).
Thedatasetweselectedforthisstudyisawell-regardedandpopulardatasetwithhighusability.Itcontainsthedetailsofaround10,000creditcardcustomersofabank,wherebyaportionofcustomerschurned.Thegoalistoidentify,basedon18variablessuchasage,salary,creditcardlimitandsimilar,whatmakesacustomerchurn(giveupacreditcard)tobeabletopredictcustomersatriskofchurninginthefuture,aswellastoidentifythevariablesthataremostpredictiveoftheriskofchurn(“Kaggle.Com”2021).Whenthedatasetwasinvestigatedforthepurposesofthisresearchproject,therewerearound210notebookssubmittedthatcontainedalgorithmicsolutionspertainingtothisdataset,withconstantdailyactivityinexistingnotebooksandnewnotebooksbeingadded.
Weselectedanopenandpublicdatasetratherthanacompetitionbecausethemajorityofnotebookssubmittedforcompetitionsareprivateandthusvisibleonlytosponsorcompanies,andcompetitionsareusuallyveryspecificandlimitthenumberofpotentialalgorithmicsolutionsapplied.Incontrast,publicnotebooksallowgoodaccesstoavarietyofnotebookscontainingfairlyunrestrictedsolutionsandallowformuchmoreexperimentationonthepartofusers.FromthemanydatasetsavailableonK,weselectedthecreditcardcustomersdatasetbecauseitisrelatedtoacommonproblemthatmanycompaniesandbusinessesface,anditisaproblemthatisoftentackledbydevelopingalgorithmicsolutions,thusitisagoodrepresentativesampleofwhatresearchersininformationsystemsandmanagementwouldconsiderofinterest.
Datacollection
InJanuaryandFebruary2021,wecollected57JupyternotebooksthatwerecreatedusingthecreditcardcustomerdatasetinPythonasthesetprogramminglanguage.Thenotebookswerearrangedfromthe‘hottest’(ameasureusedonKtodefinenotebookswithmostactivity,editsandhighestvotesbythecommunity,Kaggle.Com,2021)totheleasthot,andthusthosethatwecollectedwereconsideredamongthe‘hottest’atthetime.Wedecidedtoselectthe‘hottest’notebooksasthesewereassessedashighqualitybythecommunity,thuswerelikelytocontainwell-developedalgorithmicsolutions.WediscardednotebooksinRtoeliminatedifferencesinprogramminglanguages,andnotebooksthatcontainedonlypartialsolutions,forexampleonlyanalyzeddatawithoutbuildingactualmodels.Weendedupwith45suitablenotebooks.UsingafeatureavailableonK,wedownloadedalloftheselectednotebooksandconvertedthemtoPDFdocumentstoanalyzetheminnVivo.
Dataanalysis
Sinceourstudyisrootedingroundedtheory(Charmaz2006;GlaserandStrauss1967;UrquhartandFernandez2006),weproceededbyinductivelycodingthenotebookstoidentifythedifferentelementsofcodetheycontainedbywhattheseelementsofcodedid.Wecodedeachsegmentofcodeineachnotebooktoidentifyitsfunction.Verbaldescriptionsofdatascientistssometimesprovidedadditionalinformationastotheroleofeachcodesegment,sothesewerecodedtoo.However,thedescriptionsweremostlyusefulinthesecondstageofdataanalysis,wherewegroupedthecodesweobtainedintohigher-levelelementsoftheprocess,astheyexplainedtheflowoftheprocess.Forexample,inthenotebooksdatascientistswouldsometimesindicatetheywereproceedingtoexploratorydataanalysis,andweusedthesecommentstogroupelementsofcodeidentifiedundertheelement‘ExploratoryDataAnalysis’.
Becauseoftheinductivenatureofourstudy,weoscillatedbetweendataanalysisandfurtherdatacollection.Aftercodingthefirst30notebooks,webegantogroupthecodestostartbuildingthemodel.Wethenproceededwithcodingandanalyzingnotebooksonebyonetosupplementandverifythemodelthatwasemergingfromouranalysis.Whenwereachednotebooknumber35,thesubsequent10notebooksdidnotaddanynewcodestothecodebookandatthispointwedecidedtostopcodingandanalyzingthenotebooksaswereachedthepointofsaturation.
UnboxingtheAlgorithm
Inthissection,wepresenttheelementsoftheprocessofmakinganalgorithmicsolutionworkthatweidentifiedinthedata.Eachelementisdiscussedinturnbyshowingwhatkindsofoperationswereperformedineveryelement.
PreparingtheEnvironment
Notebooksbeginwithsettingtheenvironmentinwhichthedevelopmentofthealgorithmicsolutiontakesplace:programminglanguage,accelerationandconnectiontotheinternet.ThenotebooksweobservedwereallsetupinaPython3environment,which“comeswithmanyhelpfulanalyticslibrariesinstalled”(Notebook002)andallowstowriteupto20GBtotheworkingdirectory.Notebooksgivethepossibilitytoturnonanaccelerator,suchasaGPU,forfasterprocessing,andtoconnecttotheinternetforaccesstoexternalfiles.Insomenotebooks,datascientistsuseverbalcommentstoidentifyandrestatetheproblem.
Afterthisinitialsetup,variousnecessarylibrariesareimported,thatispre-packagedfunctionsdesignedforspecificpurposesthatcanbedeployedbydatascientistswithouttheneedtocodesuchfunctionsfromscratch.Invariably,thenotebooksfeatured“numpy”(Notebook005),aPythonlibraryforlinearalgebraand“pandas”(Notebook007)allowingfordataprocessingandforexamplereadinginCSVfiles,amongothers.Thesetwolibrariesareessentialtodevelopthealgorithmicsolution.Otherlibrariesimportedincludedatavisualizationpackages,suchas“seaborn”or“matplotlib”(Notebook029),whicharefairlystandardandpopularlibrariesforthispurpose.Insomenotebooks,allrequiredpackagesareimportedinthebeginningofthenotebook,including“sklearn”and“keras”(Notebook014)thatareusedforbuildingmodels,whileothernotebooksimportadditionallibrariesasandwhenneeded.Librariesareimportedwithsimplecode:“importnumpyasnp”(Notebook001),forexample.Importinglibrariesisastandardprocedureandtherearenotsubstantialcommentsregardingthisstep.Thereexistsavarietyoflibraries
usedindevelopingalgorithmicsolutionsthatarewidelyused,andtheyencapsulateandabstractoutthecomplexitybehindsuchtasksliketrainingaspecificmodel,asexplainedbelow.
ReadinginData
Thenextelementintheprocessinanalgorithmicsolutionistoreadintherequireddata.Thefirststephere,quitelogically,includesloadingdatain.BecausethedatasetthatthenotebooksuseisuploadedtoKaggle,itcanbeattachedtoeachnotebookwithasimplesearchwithintheinterface,andthenimportedbyexecutingacommandfromthe“pandas”library“read_csv”(Notebook001).
Inspectingthedatafollows,usuallythroughfunction“head”,displayingfirstfive(bydefault)rowsofthedatasetandcorrespondingcolumnswithcolumnheaders,andsometimesfunction“shape”displayingthedimensionsofthedataset(numberofrowsandnumberofcolumns)aswellasfunction“columns”,givingthenamesofcolumnsinthedataset.Injustonenotebook,weobservedexplicitlylookingforduplicateentriesinthedataset.Commandstoperformthesefunctionsarepre-packagedandtakeformsof“df.head()”,“df.shape”or“df.columns”(Notebook003).Thisstageoftheprocessalsoinvolvescheckingdatatypespresentinthedataset,performedbyusingfunctions“info”or“dtypes”thatindicatewhichcolumnscontaininteger(wholenumbers),float(fractionswithdecimalpoints)orobject(textormixednumericandnon-numericvalues)datatype.Thisisimportantasmostalgorithmicsolutionsworkonlywithnumericalvalues.Aspartofreadingindata,simpledescriptivestatisticsofthedataareobtainedthroughfunction“describe”,resultingindisplayingthenumberofrows,mean,standarddeviation,minimumvalue,quartiles,andmaximumvalueforeachcolumn.
Conductingthethesestepsisessentialtoloadthedatasetandobtainbasicinformationaboutthedataneededtoconfirmthatthedataisloadedcorrectly,containstheexpectedcolumnsandrows,andtogaininitialfamiliaritywiththedataset.
CleaningData
Afterreadinginthedataset,dataiscleanedtoprepareitforfurtherprocessing.Thisisessentialbeforeanyanalysiscantakeplace.Stepsatthisstagetendtobetakeninvariousordersacrossthenotebooks,andarereportedhereinnoparticularorder.
Missingvaluesareidentifiedanddealtwith:thatisNULLvaluesinthedatasethavetoberesolvedbeforeanyanalysiscantakeplace.Thisisdonebyusingthefunction“isnull”,listingallcolumnswiththenumberofmissingvalues(Notebook001).Thecustomerchurndatasetcontainednonullvalues,sointhiscasetherewasnoneedtodeploysolutionstosolvethisproblem.Missingvalueshavetoberesolvedasthemajorityofalgorithmscannotdealwithdatasetscontainingmissingvalues.Oneofthewaystosolvethisproblemthatispresentedinthenotebooksisthemethodofimputation,thatisreplacingthemissingornullvaluewithanexistingvaluefromthedataset.Inthesolutionproposedinthenotebookthisisdonebasedonthenearestneighborofthemissingvalue,butsincenomissingvaluesweredetected,thesolutionisnotimplemented.
Inthenotebooks,wefoundsometimescolumnsarerenamediftheirnamesarenotintuitiveenoughorsimplytoolong.Certaincolumnscontainingvariablesthatarenotneededfortheanalysisareremoved.Forexample,thecustomerchurndatasetcontainstwocolumnswithNa?veBayesClassifierbydefault,andtheauthorofthedatasetsuggestsremovingthesecolumnsbeforeproceedingwithanalysis.Atdifferentpointsindatacleaning,exploratorydataanalysisorpre-processingthedatasetvariouscolumnsarealsoremovediftheyarenotcontributingtothemodel(forexample,removingcustomerID:“data=data.drop(columns=[‘CLIENTNUM’]”,Notebook015).Insomenotebooks,outliersareremovedfromthedatausingacommonstatisticalmeasureofz-score,indicatinghowfarfromthemeanagivendatapointis.Intheonlynotebookweobservedthatremovedoutliers,thisresultedinremoving810rowsfromthedataset.
Allnotebookswestudiedtransformdatatypesaspartofcleaningdata.Thisstep,sometimesreferredtoasfeatureengineering,isrequiredwhenthedatasetcontainsobjectdatatypes,whicharecategoricalvariablestypicalinmanydatasets,suchasmaritalstatus,levelofeducationorgender.Thesedatatypeshavetobetransformedintonumericalvariablesinordertobeanalyzed.Thisisconductedbyusingpre-existingfunctionstoencodethesevariablesasintegers(e.g.primaryeducationas1,secondaryas2,tertiaryas3)or
usingpopularone-hotencodingwherethereisnonaturalordinalrelationshipbetweencategoriesanddummyvariablesarecreated(e.g.maleis0,femaleis1).Cleaneddataisanessentialelementofanyalgorithmicsolution,aswithoutthestepstakeninthiselement,dataeitherresultsinerroneousanalysisandmodeltraining,orsimplycannotbeusedtotrainmodels.
ExploratoryDataAnalysis
Thenextstepinthealgorithmicsolutionprocessisexploratorydataanalysis,wherebyactionsaretakentolearnabouttherelationshipbetweenthedependentvariableofinterest(here:customerchurnorattrition)andindependentvariablesthatmayhelpbuildthepredictivemodel.Thisstepisessentialtouncoverwhatmodelwillbethemostappropriateforthedatasetandwhichvariablescanbepotentiallyofinterest.
Thefirststepistoidentifythedependentvariable(atrivialmatterinthegivendataset),andtoanalyzeindependentvariables.Thisisveryoftenperformedbyvisualizingthemindependently,inrelationtoeachother,orinrelationtothedependentvariable.Inmostcases,suchvisualizationswereimplementedusingfunctionsfromvisualizationlibraries,suchas“seaborn”,“matplotlib”orrarely“plotly”.Visualizingdataisthepartthattakesupthemostcodeinnearlyallnotebooksweanalyzed.Variousvisualizationsareproduced,suchasboxplots,piecharts,histograms,inordertohelpidentifywhichvariablesmaybeusefulinbuildingthemodel.Visualizationsareoftenaccompaniedbycommentssuchas“Femalesareslightlymorelikelytochurnwith17%comparedtomaleswith15%,we’llconvertthis9featureto1-0”(Notebook013).Somenotebookscontainmorecomprehensivecommentsonthelearningsfromvisualizations.
Thenextstepinexploratorydataanalysisistoidentifycorrelationsbetweenvariables.Identifyingcorrelationsisanimportantstepinexploratorydataanalysis,asfromthisdecisionscanbemadeastowhichfeaturestoincludeinpre-processingthedatasetformodelbuilding,asdescribedbelow.Forexample,Notebook022basedontheidentificationofcorrelationsdecidesto“#Dropsomefeatureswhichhavelessthan0.01correlationandgreaterthan-0.01correlation”.Exploratorydataanalysisisarequiredstepofbuildinganalgorithmicsolutionasitprovidesthenecessaryinsightintothedatasetforthepurposesofmodelbuilding.Itisatthisstagethattheimportanceofvariableswithrespecttothetargetvariableisassessed.
Pre-processingtheDataset
Thefollowingstepintheprocessistopre-processthedataset,whichinvolvespreparingthedatasetaccordingtotherequirementsofmodelbuilding.First,dataneedtobescaled,whichmayinvolveactualscaling,thatischangingtherangeofvariablestoacommonrange,e.g.between0and1,ornormalizingthevariablesfollowinganormaldistribution.Scalingisperformedtoensurethatnovariableisinterpretedasmorepredictivethanitactuallyisjustbecauseitsnumericalvaluesareonadifferentfromothervariables.Scalingisroutinelyperformedusingstandardpre-packagedfunctions,suchas“StandardScaler”fromthepopular“sklearn”library(Notebook026).
Thedatasetshouldberesampledifitisnotbalanced,thatisifonecategoryispresentmuchmorefrequentlythananother.Inthecaseofthedatasetinvestigated,customerswhoattiredoccurredmuchlessfrequently,asidentifiedinexploratorydataanalysis,soresamplingwasrequired.Thisisusuallydonebyoversamplingfromthegroupofattiredcustomers,mostfrequentlyusingapre-packagedfunction‘SMOTE’(SyntheticMinorityOversamplingTechnique)whichcreatesadditionaldatapoi
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年度文化產(chǎn)品出口代理協(xié)議模板3篇
- 2025年度征收補(bǔ)償與安置補(bǔ)償協(xié)議執(zhí)行監(jiān)督辦法4篇
- 2024年04月湖南國(guó)家開發(fā)銀行湖南分行暑期實(shí)習(xí)生招考筆試歷年參考題庫(kù)附帶答案詳解
- 個(gè)人汽車租借協(xié)議2024年標(biāo)準(zhǔn)格式樣張版B版
- 2025年度文化創(chuàng)意產(chǎn)業(yè)園區(qū)場(chǎng)地租賃管理協(xié)議4篇
- 個(gè)人與公司買賣合同范本完整版
- 2025年度文化產(chǎn)業(yè)園區(qū)場(chǎng)地合作開發(fā)合同協(xié)議書4篇
- 2024版室外房屋墻面裝修合同書版B版
- 2025年度化妝品全球包銷代理合同范本4篇
- 2024裝飾裝修合同的法律適用
- 礦山安全生產(chǎn)法律法規(guī)
- 標(biāo)點(diǎn)符號(hào)的研究報(bào)告
- 小學(xué)數(shù)學(xué)《比的認(rèn)識(shí)單元復(fù)習(xí)課》教學(xué)設(shè)計(jì)(課例)
- 詞性轉(zhuǎn)換清單-2024屆高考英語外研版(2019)必修第一二三冊(cè)
- GB/T 44670-2024殯儀館職工安全防護(hù)通用要求
- 安徽省合肥市2023-2024學(xué)年七年級(jí)上學(xué)期期末數(shù)學(xué)試題(含答案)
- 合同債務(wù)人變更協(xié)議書模板
- 2024年高中生物新教材同步選擇性必修第三冊(cè)學(xué)習(xí)筆記第4章 本章知識(shí)網(wǎng)絡(luò)
- 西班牙可再生能源行業(yè)市場(chǎng)前景及投資研究報(bào)告-培訓(xùn)課件外文版2024.6光伏儲(chǔ)能風(fēng)電
- 2024-2029年中國(guó)制漿系統(tǒng)行業(yè)市場(chǎng)現(xiàn)狀分析及競(jìng)爭(zhēng)格局與投資發(fā)展研究報(bào)告
- (正式版)SHT 3225-2024 石油化工安全儀表系統(tǒng)安全完整性等級(jí)設(shè)計(jì)規(guī)范
評(píng)論
0/150
提交評(píng)論