Benchmarking SAT Solvers for Bounded Model Checking

icon

15

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

15

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

Benchmarking SAT Solvers for Bounded
Model Checking
Emmanuel Zarpas
IBM Haifa Research Laboratory,
zarpas@il.ibm.com
Abstract. Modern SAT solvers are highly dependent on heuristics.
Therefore, benchmarking is of prime importance in evaluating the per-
formances of different solvers. However, relevant benchmarking is not
necessarily straightforward. We present our experiments using the IBM
CNF Benchmark on several SAT solvers. Using the results, we attempt
to define guidelines for a relevant benchmarking methodology, using SAT
solvers for real life BMC applications.
1 Introduction
Over the past decade, formal verification via model checking has evolved from a
theoretical concept to a production-level technique. It is being actively used in
chip design projects across the industry, where formal verification engineers can
now tackle the verification of large industrial hardware designs. Bounded Model
Checking (BMC) [1] has recently enabled the verification of problems that used
to be beyond the reach of formal verification. However, BMC performance relies
heavily on the performance of the underlying SAT solver.
We conducted several experiments using the IBM CNF benchmark [12, 11]
to evaluate SAT tools. Often, we found discrepancies between our results and
the experimental results found in the literature. For example, our results are not
necessarily in line with the results of the SAT03 contest [4]. The main goal of this
paper is to outline a methodology for benchmarking SAT ...
Voir icon arrow

Publié par

Nombre de lectures

42

Langue

English

BenchmarkingSATSolversforBoundedModelCheckingEmmanuelZarpasIBMHaifaResearchLaboratory,azprsai@.lbi.mocmAbstract.ModernSATsolversarehighlydependentonheuristics.Therefore,benchmarkingisofprimeimportanceinevaluatingtheper-formancesofdifferentsolvers.However,relevantbenchmarkingisnotnecessarilystraightforward.WepresentourexperimentsusingtheIBMCNFBenchmarkonseveralSATsolvers.Usingtheresults,weattempttodefineguidelinesforarelevantbenchmarkingmethodology,usingSATsolversforreallifeBMCapplications.1IntroductionOverthepastdecade,formalverificationviamodelcheckinghasevolvedfromatheoreticalconcepttoaproduction-leveltechnique.Itisbeingactivelyusedinchipdesignprojectsacrosstheindustry,whereformalverificationengineerscannowtackletheverificationoflargeindustrialhardwaredesigns.BoundedModelChecking(BMC)[1]hasrecentlyenabledtheverificationofproblemsthatusedtobebeyondthereachofformalverification.However,BMCperformancereliesheavilyontheperformanceoftheunderlyingSATsolver.WeconductedseveralexperimentsusingtheIBMCNFbenchmark[12,11]toevaluateSATtools.Often,wefounddiscrepanciesbetweenourresultsandtheexperimentalresultsfoundintheliterature.Forexample,ourresultsarenotnecessarilyinlinewiththeresultsoftheSAT03contest[4].ThemaingoalofthispaperistooutlineamethodologyforbenchmarkingSATsolversonBoundedModelCheckingproblems.Thepaperisorganizedasfollows:InSection2,wepresentexperimentalresultsforsomefamousSATsolvers.Section3presentsanewversionoftheIBMBenchmarkanddiscussescorrelationoftheCNFcharacteristicsandsolversperformance.Section4containsadiscussiononSATbenchmarkingforBMC,wherewetrytolearnfromourexperimentalresultsinlightofotherexperimentssuchastheSATcontests.Section5concludesthepaper.2ExperimentswithIBM2002Benchmark:SATSolverComparison[O7]u,rzeCxhpaeriImIe(nztCshcaom2p0a0re4.d5.f1o3u)r[f3a],mBoeurskSMAiTn5s6o1lv[e2r]sa:nzdChSiaegeI(v2400[81]..2.W17evuesresdiotnh)eF.BacchusandT.Walsh(Eds.):SAT2005,LNCS3569,pp.340–354,2005.cSpringer-VerlagBerlinHeidelberg2005
BenchmarkingSATSolversforBoundedModelChecking3412002versionoftheIBMCNFBenchmark[12],whoseCNFsaregeneratedthroughBMCfromtheIBMFormalVerificationBenchmarkLibrary[11].Thislibraryisacollectionofmodelsfromreal-lifeindustrialhardwareverificationprojects.Foreachmodel,weusedtheCNFformulas(SATdat.k.cnf)forboundwithk=1,10,15,20,25,30,35,40,45,50.Thetime-outwassetat10,000secondsonaworkstationwith867841XIntel(R)Xeon(TM)CPU2.40GHz,with512KBfirstlevelcacheand2.5GBphysicalmemory.Weusedthedefaultconfigurationforeachengine.2.1zChaIvs.BerkMin561WeranzChaffI(2001.21.17version)andBerkMin561onthe2002versionoftheIBMCNFBenchmark.Table1isasummaryoftheresultsobtained.MoreTable1.Theresultsaredisplayedinsecondsandincludethetimeout10,000secondszChaffBerkMin561Totaltime(10000sectime-out)344765414094First(#ofCNFwheretheengineisthefastest)298131#ofunsolvedproblems(timeoutreached)2530+(#ofCNFwheretheengineisthefastestbymorethanaminuteand20%)6732Firstbymodel(#ofmodelwheretheengineisthefastest)2818completeexperimentalresultsarepresentedinTable8,whereforeachmodelinthetable,wepresentthesumoftheresultsofSATdat.k.cnfwithk=1,10,15,20,25,30,35,40,45,50(i.e.,theBMCtranslationofeachmodelforthedifferentk).Formoredetails,seethecompleteresultsin[13].BecauseSATsolversdonotalwaysbehaveinhomogeneousmanner(Cf.detailedresultsin[13]ortable2),thecompleteresultsanalysisshouldnotbedisregarded.TheresultsshowthatzChaffandBerkMin561achievedcloseresults.OnsomeCNFs,zChaffrunsfaster,andonothers,BerkMin561runsfaster.Inmostcases,thedifferencesintheirperformanceisnotverysignificant,thehighestspeedisnotfasterbymorethanoneminuteor20%.However,whilezChaffseemstoperformslightlybetteroverall,BerkMin561getsbetterresultsthanzChaffontheUNSATCNFs.Fromtheseresults,itisnotpossibletoconcludethatBerkMin561performsbetterthanzChaff.ThisisnotconsistentwithwhatcanbereadintheliteratureorwiththefinalresultsoftheSAT03contest(Cf.section2.4).ThisisnotaverysatisfyingconclusionandthepreviousmetricsdonottellusinasimplewayhowmuchfasterzChaffisthanBerkmin5612.Weneedsome12BerkMin561wassecondintheSAT03contestindustrialbenchmarkscategoryintheSAT03contest;thewinnerForkliftisnotpubliclyavailable.Itisveryimportanttobeabletogiveclearandconcisevaluesforresultsdissemi-nation.
342E.ZarpasTable2.TheresultsaredisplayedinsecondsfortheCNFsfrom18rulemodel.Thetime-outwassetto10000seconds.zChaffperformstheworstfork=15,20,35;Berk-Min561performstheworstfork=30,40,50;Siegev4hasthebestperformance,exceptfork=15,2518ruleCNFResultzChaBerkMinSiegeSATdat.k1.cnfunsat000SATdat.k10.cnfunsat110SATdat.k15.cnfunsat1345SATdat.k20.cnfunsat654741SATdat.k25.cnfunsat951109251SATdat.k30.cnfsat700755557SATdat.k35.cnfsattime-out22301730SATdat.k40.cnfsat55108060247SATdat.k45.cnfsattime-outtime-out959SATdat.k50.cnfsat5010time-out1370additionalmetricsinordertocomparetheseSATsolvers.Thearithmeticmeanofthespeedupseemsirrelevant.Let’ssaywewanttocomparetheperformancesofsolversAandB.IfthespeedupofAvs.Bis10ontest1and0.0001ontest2,thearithmeticmeanofthespeedup(Avs.B)wouldbe5andthemeanofthespeedup(Bvs.A)wouldbe5000.Thisisclearlynotrelevant.Theglobalspeedup(TotaltimeA/TotalTimeB)isgoodtohavebutthe“longruns”havefarmoreweightthantheshortones.Themedianismuchmoreinteresting.howeveritdoesnottakeintoaccounttheextremevalues(e.g,ifontest1thespeedupisequaltomedian+10ortomedian+0.1,itdoesnotchangethemedianvalue).Ontheotherhand,themedianinsensivitytoextremevaluesmakesitlessdependentonthetime-outs.Thegeometricalmeanseemstobeagoodsolution.Itdefinitelytakesintoaccounttheweightofthe“negativespeedup”(i.e.,aspeedupoflessthan1)3.Wefeltthesevaluesshouldbecomputedthesevaluesonlyonarelevantsubsetofthebenchmark.Forexample,wediscardedallthecaseswherethefourSATsolverstime-out.Inaddition,wediscardedcasesforwhichthefoursolvers(zChaI,Berkmin561,Siegev4,zChaII)runinlessthanveseconds.Therationalisthatitisdifficulttoaccuratelycompareverysmallruntimesandthattheseverysmallruntimesarearguablynotrelevant.TheresultsdisplayedinTable3(b)showthatBerkmin561isnearlytwotimesslowerthanzChaffI,butthatBerkminbehavessomehowbetterinthe“long”cases.2.2Siegev4Siegewashors-concoursfortheSAT03contest,thereforeitdidnotparticipateinthesecondstage.Nevertheless,theSiegeresultswereprettygoodforthe3Notethatthelogarithmofthegeometricalmeanisequaltothearithmeticalmeanofthelogarithms.
BenchmarkingSATSolversforBoundedModelChecking343Table3.Witha10000sec.time-outTotalTime#time-outs(inseconds)zChaI344,76425berkmin561414,03530Siegev419723914zChaII389,30431(a)SpeedupzBCerhkamffinIzCSiheagfefIzzCChhaaffffIIIgmeeodmiaenan00..455122..980711..2239global0.753.580.75(b)firststageandSiegehasareputationforbeingoneofthebestSATsolversavailable.WeranSiegev4[8]onthebenchmarkusing123456789asaseed.Table3displaystheoverallconclusions.FormoredetailsseeTable8and[13].Siegev4isthefastestin298cases(CNFs)outof498.Inmanydicultcases,Siegev4isfastestbyanorderofmagnitude,ormore.Insomecases,Siegev4performedsignicantlyworsethanzChaorBerkMin561.Forexample,for26rule,Siegev4isslowerthanzChaffbyanorderofmagnitudeandslowerthanBerkMinbytwoordersofmagnitude.Inaddition,withinthetime-out,severalCNFscanbesolvedonlybySiegev44.Inconclusion,weseethatSiegev4performssignificantlybetter(roughlyspeaking2.5timesfaster)thanzChaffI,Berkmin561andzChaffIIontheIBMCNF2002benchmark.2.3zChaIITheindustrialcategoryoftheSAT04competition[6]waswonbyzChaffII(zChaff2004.5.13).However“black-box”solverssuchasForklift(the2003edi-tionwinner)were“hors-concours”andassuchnotallowedtoenterthesecondstageofthecompetition.WefocushereontheperformanceofzChaffIIvszChaffI.zChaffIIisfasterthan:zChafffor229CNFs(outof442),Berkmin561for208CNFs(outof442),Siegev4for82CNFs(outof442).Inaddition,zChaIIreachedtime-outmoreoftenthanzChaffI(forsixmorecases).Table3dis-playsasyntheticviewoftheresults.ThezChaffIIcodewasleftunchanged,soitusedrandomseeds.Siegev4isroughlythreetimefasterthanzChaI.Figure1givesusagraphicviewoftheperformanceofzChaffIIvs.zChaffI.Theover-allperformancedifferencebetweenzChaffIandzChaffIIonourbenchmarkissmall5,eventhoughthetwosolverscanbehaveverydifferentlyinspecificcases.45Siegeisarandomizedsolver,therefore,itcouldbearguedthatthecomparisonisnotfairsincethezChaffandBerkMin561,versionsweusedwerenotrandomized.Nevertheless,evenadeterministicsolvercanbeluckyandprovidingSiegewithaseedofSiegemakesitdeterministic.ItshouldbenotethatzChaffIIisrandomized,sotheresultscouldbedifferentforotherruns.ItwouldbeinterestingtohaveastatisticaldescriptionoftherangeofzChaffIIperformances.
344E.Zarpas0000100010010110.11101001000100001.010.0zchaff I runtime08070605040302100niB(a)(b)Fig.1.Histogram(b)givesthedistributionofthedecimallogarithmofzChaffI/zChaffIIspeedupwith10000sectime-out2.4ComparisonwithSATContestResultsInordertounderstandwhetherornotourresultsareconsistentwiththoseoftheSAT03contest6(fordetailsresultssee[5]),wehadalookattheresultsofthefirstandsecondstagesofthecompetition.Firststage.Lookingattheresults7fortheindustrialcategory[5],wenotethefollowing:Series13rule8isover-represented(Cf.Table4).Besides,sinceallsolverstime-outonmostoftheCNFsfromthisseries,itisnotverymeaningful.Exceptforseries13ruleand11rule,theseriesfromtheIBMCNFbench-markareeasyforzCha,BerkMin561,andSiegev1.Resultsforrule07arenotconsistentwithourownexperiences.Thisdis-crepancyisprobablycausedbythe“Lisasyndrome”[4]:CNFswereshuffledforSAT03andsolverperformancecandifferdramaticallybetweenashuffledCNFandtheoriginal.TheSAT03contestresultsforBerkMin561,Forklift,Siegev1,andzChaon(shuffled)seriesfromtheIBMCNFBenchmarkaresummarizedinTable4.Ifresultsfromseries07ruleand13rulearediscarded,zChaffshowsbetterresultsthanBerkMin561.678InthefirststageoftheSAT04competition,zChaffIandzChaffIIhaveverycloseresultsandthesolversthatoutperformedthem(Forklift,Oepir)arenotavailableforexperimentation.TheSiegeversionusedfortheSAT03contestisSiegev1.ThenamesoftheSAT03seriesappearsinitalicsinordertodifferentiatethemfromtheseriesweusedinourexperiments.FortheexactcompositionoftheSAT03series,see[5].
BenchmarkingSATSolversforBoundedModelChecking345Table4.PartialresultsfromfirststageSAT03contestBerkMin561forkliftSiege1zChaffTotal#ofSolvedBenchmarks112112112101TotalCPUtimeneeded(sec)105000103000103000114000without13rule(sec)15105184084160without13ruleand07rule(sec)1410424268553WebelievethatthedifferencesinresultsforthefirststageofSAT03contestresultsforIBMthebenchmarkandourexperimentsareduetoclauseshufflinginSAT03andtothefactthattheSAT03experimentusedasmallertest-bedofCNFsfromtheIBMbenchmark.Secondstage.Duringthesecondstageofthecompetition,solverswererankedaccordingtothenumberofCNFstheycouldsolvefromarestrictedbenchmark.Forkliftwasrankedfirst,Berkmin561second,andzChaffIsixthontheindustrialbenchmark9.Clearly,therespectivezChaffIandBerkMin561rankingdonotcorrespondtoourexperimentalresults(seeTable3).However,inthesecondstage,allsolvers“timed-out”ontheIBMbenchmarksselected.Inotherwords,therankingofthesolversselectedforthesecondstageofthecompetitiondidnottakeintoaccountperformanceontheIBMbenchmarks.ThisprobablyexplainswhyourevaluationofzChaffandBerkMin561ontheIBMCNFBenchmarkgivesresultsthatarenotinlinewiththesecondstageresults(onindustrialbenchmarks)oftheSAT03contest.3ExperimentswithIBM2004Benchmark:SearchforCorrelation3.1DescriptionoftheIBM2004BenchmarkBoundedModelCheckingtranslationtechnologyiscontinuouslyevolving.There-fore,were-generatedtheIBMCNFBenchmarkfromtheIBMModelBenchmarkusinganalternativeBMCtool.Weranthenewtranslationforthefollowingboundsk=0..10,k=11..15,k=16..20,k=21..25,...,k=95..100.The2004CNFsareavailableon-line.Inthispaper,werefertothe2002and2004versionsastheoldandthenewbenchmarks,respectively.Table5and6presentstatisticaldescriptionsoftheoldandthenewbench-marks.Inthesetwotables,averageisthearithmeticalmean,STDEVisthestandarddeviation,clavuasresistheratiobetweenthenumberofclausesandthenumberofvariables,%unitisthepercentofunit(i.e.,oflengthone)clausesinaCNF,%binisthepercentofclausesoflengthtwo,%teristhepercentofclauses9Siegewashors-concours,thereforenotselectedforthesecondstage.
436E.ZarpasTable5.Old(2002)benchmarkvarclausesclavuarses%unit%bin%ter%l=4%l>4average80,167343,8264.120.375.514.23.56.5median54,857220,1804.080.277.012.23.56.6STDEV81,924388,1560.520.36.47.71.01.3max636,0893,172,1075.421.684.555.35.29.1min3,64514,6812.48040.84.30.10.1Table6.New(2004)Benchmarkvarclausesclavuarses%unit%bin%ter%l=4%l>4average73,414305,3013.990.674.515.03.66.3median50,897195,6123.980.476.113.13.66.6STDEV75,553349,2480.450.66.17.30.91.2max565,8892,760,5025.484.684.551.64.99.1min3,60614,1042.55046.54.40.10.1oflengththree,%l=4isthepercentofclausesoflength4and%l>4isthepercentofclauseslongerthan4.ThesetwotablestellusthatthenewbenchmarkCNFsareroughly10%smallerthantheoldbenchmarkCNFs.However,theyencompassroughlythesameproportionoflengthtwo,three,andfourclauses.Thehigherproportionofunitclausesinthenewbenchmarkisduetospecificoptimizations.Wealsoseethattheratioclavuarsesisslightlylowerinthenewbenchmark(seethenextsubsectionforadiscussionontherelevanceofthisratio).Figure2displaysthehistogramof%bin,wheretwomodels(07and131)haveanonstandardpercentofbinaryclausesandareinthe45%-55%andnotinthe70%-85%.WelookedatsomestatisticsforthestructureoftheCNFsofthenewbench-mark.Wefoundthatforanymodelofthebenchmark,therearefourrealnum-bersa,b,c,dsuchthatforalltheCNFsofthemodel,#(variables)ak+band#(clauses)ck+d,withkbeingtheboundusedtogeneratetheCNFs.Inamoregeneralway,foranymodelandforanyintegernstrictlygreaterthan0,therearetworealnumberseandfsuchthatforallCNFsofthemodel#(clausesoflengthn)ek+f(ofcourseeandfcanbenull,forexampleinmostmodels,thenumberofunitclausesisaconstant).Thecorrelations(foradiscussionaboutstatistictechniquesforNP-completeproblemssee[10])be-tweentheseriesfromthebenchmarkandtheseriespredictedwiththepreviousequationsareabout0.99.ThisisnotsurprisingsincetheCNFsaregeneratedfromthemodelinawayessentiallylineartothebound.Inaddition,wefoundoutthat,inourbenchmark,thelengthofthelongestclauseisthesameforallCNFsgeneratedfromthesamemodel.Figure2(b)displaysthehistogramofthelongestclausesinthenewbenchmark.Notethattwomodels,07and131,whichhavethegreaterlongestclauses,arealsotheoneswitha“nonstandard”percentofbinaryclauses.
Voir icon more
Alternate Text