Benchmarking SAT Solvers for Bounded Model Checking

pages

English

Documents

Lire

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

pages

English

Documents

Lire

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

Publié par

Thean

Nombre de lectures

Langue

English

Benchmarking SAT Solvers for Bounded
Model Checking
Emmanuel Zarpas
IBM Haifa Research Laboratory,
zarpas@il.ibm.com
Abstract. Modern SAT solvers are highly dependent on heuristics.
Therefore, benchmarking is of prime importance in evaluating the per-
formances of diﬀerent solvers. However, relevant benchmarking is not
necessarily straightforward. We present our experiments using the IBM
CNF Benchmark on several SAT solvers. Using the results, we attempt
to deﬁne guidelines for a relevant benchmarking methodology, using SAT
solvers for real life BMC applications.
1 Introduction
Over the past decade, formal veriﬁcation via model checking has evolved from a
theoretical concept to a production-level technique. It is being actively used in
chip design projects across the industry, where formal veriﬁcation engineers can
now tackle the veriﬁcation of large industrial hardware designs. Bounded Model
Checking (BMC) [1] has recently enabled the veriﬁcation of problems that used
to be beyond the reach of formal veriﬁcation. However, BMC performance relies
heavily on the performance of the underlying SAT solver.
We conducted several experiments using the IBM CNF benchmark [12, 11]
to evaluate SAT tools. Often, we found discrepancies between our results and
the experimental results found in the literature. For example, our results are not
necessarily in line with the results of the SAT03 contest [4]. The main goal of this
paper is to outline a methodology for benchmarking SAT ...

Voir

Publié par

Thean

Nombre de lectures

Langue

English

BenchmarkingSATSolversforBoundedModelCheckingEmmanuelZarpasIBMHaifaResearchLaboratory,azprsai@.lbi.mocmAbstract.ModernSATsolversarehighlydependentonheuristics.Therefore,benchmarkingisofprimeimportanceinevaluatingtheper-formancesofdiﬀerentsolvers.However,relevantbenchmarkingisnotnecessarilystraightforward.WepresentourexperimentsusingtheIBMCNFBenchmarkonseveralSATsolvers.Usingtheresults,weattempttodeﬁneguidelinesforarelevantbenchmarkingmethodology,usingSATsolversforreallifeBMCapplications.1IntroductionOverthepastdecade,formalveriﬁcationviamodelcheckinghasevolvedfromatheoreticalconcepttoaproduction-leveltechnique.Itisbeingactivelyusedinchipdesignprojectsacrosstheindustry,whereformalveriﬁcationengineerscannowtackletheveriﬁcationoflargeindustrialhardwaredesigns.BoundedModelChecking(BMC)[1]hasrecentlyenabledtheveriﬁcationofproblemsthatusedtobebeyondthereachofformalveriﬁcation.However,BMCperformancereliesheavilyontheperformanceoftheunderlyingSATsolver.WeconductedseveralexperimentsusingtheIBMCNFbenchmark[12,11]toevaluateSATtools.Often,wefounddiscrepanciesbetweenourresultsandtheexperimentalresultsfoundintheliterature.Forexample,ourresultsarenotnecessarilyinlinewiththeresultsoftheSAT03contest[4].ThemaingoalofthispaperistooutlineamethodologyforbenchmarkingSATsolversonBoundedModelCheckingproblems.Thepaperisorganizedasfollows:InSection2,wepresentexperimentalresultsforsomefamousSATsolvers.Section3presentsanewversionoftheIBMBenchmarkanddiscussescorrelationoftheCNFcharacteristicsandsolversperformance.Section4containsadiscussiononSATbenchmarkingforBMC,wherewetrytolearnfromourexperimentalresultsinlightofotherexperimentssuchastheSATcontests.Section5concludesthepaper.2ExperimentswithIBM2002Benchmark:SATSolverComparison[O7]u,rzeCxhpaerﬀiImIe(nztCshcaoﬀm2p0a0re4.d5.f1o3u)r[f3a],mBoeurskSMAiTn5s6o1lv[e2r]sa:nzdChSiaeﬀgeI(v2400[81]..2.W17evuesresdiotnh)eF.BacchusandT.Walsh(Eds.):SAT2005,LNCS3569,pp.340–354,2005.cSpringer-VerlagBerlinHeidelberg2005

BenchmarkingSATSolversforBoundedModelChecking3412002versionoftheIBMCNFBenchmark[12],whoseCNFsaregeneratedthroughBMCfromtheIBMFormalVeriﬁcationBenchmarkLibrary[11].Thislibraryisacollectionofmodelsfromreal-lifeindustrialhardwareveriﬁcationprojects.Foreachmodel,weusedtheCNFformulas(SATdat.k.cnf)forboundwithk=1,10,15,20,25,30,35,40,45,50.Thetime-outwassetat10,000secondsonaworkstationwith867841XIntel(R)Xeon(TM)CPU2.40GHz,with512KBﬁrstlevelcacheand2.5GBphysicalmemory.Weusedthedefaultconﬁgurationforeachengine.2.1zChaﬀIvs.BerkMin561WeranzChaﬀI(2001.21.17version)andBerkMin561onthe2002versionoftheIBMCNFBenchmark.Table1isasummaryoftheresultsobtained.MoreTable1.Theresultsaredisplayedinsecondsandincludethetimeout10,000secondszChaﬀBerkMin561Totaltime(10000sectime-out)344765414094First(#ofCNFwheretheengineisthefastest)298131#ofunsolvedproblems(timeoutreached)2530+(#ofCNFwheretheengineisthefastestbymorethanaminuteand20%)6732Firstbymodel(#ofmodelwheretheengineisthefastest)2818completeexperimentalresultsarepresentedinTable8,whereforeachmodelinthetable,wepresentthesumoftheresultsofSATdat.k.cnfwithk=1,10,15,20,25,30,35,40,45,50(i.e.,theBMCtranslationofeachmodelforthediﬀerentk).Formoredetails,seethecompleteresultsin[13].BecauseSATsolversdonotalwaysbehaveinhomogeneousmanner(Cf.detailedresultsin[13]ortable2),thecompleteresultsanalysisshouldnotbedisregarded.TheresultsshowthatzChaﬀandBerkMin561achievedcloseresults.OnsomeCNFs,zChaﬀrunsfaster,andonothers,BerkMin561runsfaster.Inmostcases,thediﬀerencesintheirperformanceisnotverysigniﬁcant,thehighestspeedisnotfasterbymorethanoneminuteor20%.However,whilezChaﬀseemstoperformslightlybetteroverall,BerkMin561getsbetterresultsthanzChaﬀontheUNSATCNFs.Fromtheseresults,itisnotpossibletoconcludethatBerkMin561performsbetterthanzChaﬀ.ThisisnotconsistentwithwhatcanbereadintheliteratureorwiththeﬁnalresultsoftheSAT03contest(Cf.section2.4).ThisisnotaverysatisfyingconclusionandthepreviousmetricsdonottellusinasimplewayhowmuchfasterzChaﬀisthanBerkmin5612.Weneedsome12BerkMin561wassecondintheSAT03contestindustrialbenchmarkscategoryintheSAT03contest;thewinnerForkliftisnotpubliclyavailable.Itisveryimportanttobeabletogiveclearandconcisevaluesforresultsdissemi-nation.

342E.ZarpasTable2.TheresultsaredisplayedinsecondsfortheCNFsfrom18rulemodel.Thetime-outwassetto10000seconds.zChaﬀperformstheworstfork=15,20,35;Berk-Min561performstheworstfork=30,40,50;Siegev4hasthebestperformance,exceptfork=15,2518ruleCNFResultzChaﬀBerkMinSiegeSATdat.k1.cnfunsat000SATdat.k10.cnfunsat110SATdat.k15.cnfunsat1345SATdat.k20.cnfunsat654741SATdat.k25.cnfunsat951109251SATdat.k30.cnfsat700755557SATdat.k35.cnfsattime-out22301730SATdat.k40.cnfsat55108060247SATdat.k45.cnfsattime-outtime-out959SATdat.k50.cnfsat5010time-out1370additionalmetricsinordertocomparetheseSATsolvers.Thearithmeticmeanofthespeedupseemsirrelevant.Let’ssaywewanttocomparetheperformancesofsolversAandB.IfthespeedupofAvs.Bis10ontest1and0.0001ontest2,thearithmeticmeanofthespeedup(Avs.B)wouldbe5andthemeanofthespeedup(Bvs.A)wouldbe5000.Thisisclearlynotrelevant.Theglobalspeedup(TotaltimeA/TotalTimeB)isgoodtohavebutthe“longruns”havefarmoreweightthantheshortones.Themedianismuchmoreinteresting.howeveritdoesnottakeintoaccounttheextremevalues(e.g,ifontest1thespeedupisequaltomedian+10ortomedian+0.1,itdoesnotchangethemedianvalue).Ontheotherhand,themedianinsensivitytoextremevaluesmakesitlessdependentonthetime-outs.Thegeometricalmeanseemstobeagoodsolution.Itdeﬁnitelytakesintoaccounttheweightofthe“negativespeedup”(i.e.,aspeedupoflessthan1)3.Wefeltthesevaluesshouldbecomputedthesevaluesonlyonarelevantsubsetofthebenchmark.Forexample,wediscardedallthecaseswherethefourSATsolverstime-out.Inaddition,wediscardedcasesforwhichthefoursolvers(zChaﬀI,Berkmin561,Siegev4,zChaﬀII)runinlessthanﬁveseconds.Therationalisthatitisdiﬃculttoaccuratelycompareverysmallruntimesandthattheseverysmallruntimesarearguablynotrelevant.TheresultsdisplayedinTable3(b)showthatBerkmin561isnearlytwotimesslowerthanzChaﬀI,butthatBerkminbehavessomehowbetterinthe“long”cases.2.2Siegev4Siegewashors-concoursfortheSAT03contest,thereforeitdidnotparticipateinthesecondstage.Nevertheless,theSiegeresultswereprettygoodforthe3Notethatthelogarithmofthegeometricalmeanisequaltothearithmeticalmeanofthelogarithms.

BenchmarkingSATSolversforBoundedModelChecking343Table3.Witha10000sec.time-outTotalTime#time-outs(inseconds)zChaﬀI344,76425berkmin561414,03530Siegev419723914zChaﬀII389,30431(a)SpeedupzBCerhkamffinIzCSiheagfefIzzCChhaaffffIIIgmeeodmiaenan00..455122..980711..2239global0.753.580.75(b)ﬁrststageandSiegehasareputationforbeingoneofthebestSATsolversavailable.WeranSiegev4[8]onthebenchmarkusing123456789asaseed.Table3displaystheoverallconclusions.FormoredetailsseeTable8and[13].Siegev4isthefastestin298cases(CNFs)outof498.Inmanydiﬃcultcases,Siegev4isfastestbyanorderofmagnitude,ormore.Insomecases,Siegev4performedsigniﬁcantlyworsethanzChaﬀorBerkMin561.Forexample,for26rule,Siegev4isslowerthanzChaﬀbyanorderofmagnitudeandslowerthanBerkMinbytwoordersofmagnitude.Inaddition,withinthetime-out,severalCNFscanbesolvedonlybySiegev44.Inconclusion,weseethatSiegev4performssigniﬁcantlybetter(roughlyspeaking2.5timesfaster)thanzChaﬀI,Berkmin561andzChaﬀIIontheIBMCNF2002benchmark.2.3zChaﬀIITheindustrialcategoryoftheSAT04competition[6]waswonbyzChaﬀII(zChaﬀ2004.5.13).However“black-box”solverssuchasForklift(the2003edi-tionwinner)were“hors-concours”andassuchnotallowedtoenterthesecondstageofthecompetition.WefocushereontheperformanceofzChaﬀIIvszChaﬀI.zChaﬀIIisfasterthan:zChaﬀfor229CNFs(outof442),Berkmin561for208CNFs(outof442),Siegev4for82CNFs(outof442).Inaddition,zChaﬀIIreachedtime-outmoreoftenthanzChaﬀI(forsixmorecases).Table3dis-playsasyntheticviewoftheresults.ThezChaﬀIIcodewasleftunchanged,soitusedrandomseeds.Siegev4isroughlythreetimefasterthanzChaﬀI.Figure1givesusagraphicviewoftheperformanceofzChaﬀIIvs.zChaﬀI.Theover-allperformancediﬀerencebetweenzChaﬀIandzChaﬀIIonourbenchmarkissmall5,eventhoughthetwosolverscanbehaveverydiﬀerentlyinspeciﬁccases.45Siegeisarandomizedsolver,therefore,itcouldbearguedthatthecomparisonisnotfairsincethezChaﬀandBerkMin561,versionsweusedwerenotrandomized.Nevertheless,evenadeterministicsolvercanbeluckyandprovidingSiegewithaseedofSiegemakesitdeterministic.ItshouldbenotethatzChaﬀIIisrandomized,sotheresultscouldbediﬀerentforotherruns.ItwouldbeinterestingtohaveastatisticaldescriptionoftherangeofzChaﬀIIperformances.

344E.Zarpas0000100010010110.11101001000100001.010.0zchaff I runtime08070605040302100niB(a)(b)Fig.1.Histogram(b)givesthedistributionofthedecimallogarithmofzChaﬀI/zChaﬀIIspeedupwith10000sectime-out2.4ComparisonwithSATContestResultsInordertounderstandwhetherornotourresultsareconsistentwiththoseoftheSAT03contest6(fordetailsresultssee[5]),wehadalookattheresultsoftheﬁrstandsecondstagesofthecompetition.Firststage.Lookingattheresults7fortheindustrialcategory[5],wenotethefollowing:–Series13rule8isover-represented(Cf.Table4).Besides,sinceallsolverstime-outonmostoftheCNFsfromthisseries,itisnotverymeaningful.–Exceptforseries13ruleand11rule,theseriesfromtheIBMCNFbench-markareeasyforzChaﬀ,BerkMin561,andSiegev1.–Resultsforrule07arenotconsistentwithourownexperiences.Thisdis-crepancyisprobablycausedbythe“Lisasyndrome”[4]:CNFswereshuﬄedforSAT03andsolverperformancecandiﬀerdramaticallybetweenashuﬄedCNFandtheoriginal.TheSAT03contestresultsforBerkMin561,Forklift,Siegev1,andzChaﬀon(shuﬄed)seriesfromtheIBMCNFBenchmarkaresummarizedinTable4.Ifresultsfromseries07ruleand13rulearediscarded,zChaﬀshowsbetterresultsthanBerkMin561.678IntheﬁrststageoftheSAT04competition,zChaﬀIandzChaﬀIIhaveverycloseresultsandthesolversthatoutperformedthem(Forklift,Oepir)arenotavailableforexperimentation.TheSiegeversionusedfortheSAT03contestisSiegev1.ThenamesoftheSAT03seriesappearsinitalicsinordertodiﬀerentiatethemfromtheseriesweusedinourexperiments.FortheexactcompositionoftheSAT03series,see[5].

BenchmarkingSATSolversforBoundedModelChecking345Table4.PartialresultsfromﬁrststageSAT03contestBerkMin561forkliftSiege1zChaﬀTotal#ofSolvedBenchmarks112112112101TotalCPUtimeneeded(sec)105000103000103000114000without13rule(sec)15105184084160without13ruleand07rule(sec)1410424268553WebelievethatthediﬀerencesinresultsfortheﬁrststageofSAT03contestresultsforIBMthebenchmarkandourexperimentsareduetoclauseshuﬄinginSAT03andtothefactthattheSAT03experimentusedasmallertest-bedofCNFsfromtheIBMbenchmark.Secondstage.Duringthesecondstageofthecompetition,solverswererankedaccordingtothenumberofCNFstheycouldsolvefromarestrictedbenchmark.Forkliftwasrankedﬁrst,Berkmin561second,andzChaﬀIsixthontheindustrialbenchmark9.Clearly,therespectivezChaﬀIandBerkMin561rankingdonotcorrespondtoourexperimentalresults(seeTable3).However,inthesecondstage,allsolvers“timed-out”ontheIBMbenchmarksselected.Inotherwords,therankingofthesolversselectedforthesecondstageofthecompetitiondidnottakeintoaccountperformanceontheIBMbenchmarks.ThisprobablyexplainswhyourevaluationofzChaﬀandBerkMin561ontheIBMCNFBenchmarkgivesresultsthatarenotinlinewiththesecondstageresults(onindustrialbenchmarks)oftheSAT03contest.3ExperimentswithIBM2004Benchmark:SearchforCorrelation3.1DescriptionoftheIBM2004BenchmarkBoundedModelCheckingtranslationtechnologyiscontinuouslyevolving.There-fore,were-generatedtheIBMCNFBenchmarkfromtheIBMModelBenchmarkusinganalternativeBMCtool.Weranthenewtranslationforthefollowingboundsk=0..10,k=11..15,k=16..20,k=21..25,...,k=95..100.The2004CNFsareavailableon-line.Inthispaper,werefertothe2002and2004versionsastheoldandthenewbenchmarks,respectively.Table5and6presentstatisticaldescriptionsoftheoldandthenewbench-marks.Inthesetwotables,averageisthearithmeticalmean,STDEVisthestandarddeviation,clavuasresistheratiobetweenthenumberofclausesandthenumberofvariables,%unitisthepercentofunit(i.e.,oflengthone)clausesinaCNF,%binisthepercentofclausesoflengthtwo,%teristhepercentofclauses9Siegewashors-concours,thereforenotselectedforthesecondstage.

436E.ZarpasTable5.Old(2002)benchmarkvarclausesclavuarses%unit%bin%ter%l=4%l>4average80,167343,8264.120.375.514.23.56.5median54,857220,1804.080.277.012.23.56.6STDEV81,924388,1560.520.36.47.71.01.3max636,0893,172,1075.421.684.555.35.29.1min3,64514,6812.48040.84.30.10.1Table6.New(2004)Benchmarkvarclausesclavuarses%unit%bin%ter%l=4%l>4average73,414305,3013.990.674.515.03.66.3median50,897195,6123.980.476.113.13.66.6STDEV75,553349,2480.450.66.17.30.91.2max565,8892,760,5025.484.684.551.64.99.1min3,60614,1042.55046.54.40.10.1oflengththree,%l=4isthepercentofclausesoflength4and%l>4isthepercentofclauseslongerthan4.ThesetwotablestellusthatthenewbenchmarkCNFsareroughly10%smallerthantheoldbenchmarkCNFs.However,theyencompassroughlythesameproportionoflengthtwo,three,andfourclauses.Thehigherproportionofunitclausesinthenewbenchmarkisduetospeciﬁcoptimizations.Wealsoseethattheratioclavuarsesisslightlylowerinthenewbenchmark(seethenextsubsectionforadiscussionontherelevanceofthisratio).Figure2displaysthehistogramof%bin,wheretwomodels(07and131)havea“nonstandard”percentofbinaryclausesandareinthe45%-55%andnotinthe70%-85%.WelookedatsomestatisticsforthestructureoftheCNFsofthenewbench-mark.Wefoundthatforanymodelofthebenchmark,therearefourrealnum-bersa,b,c,dsuchthatforalltheCNFsofthemodel,#(variables)≈ak+band#(clauses)≈ck+d,withkbeingtheboundusedtogeneratetheCNFs.Inamoregeneralway,foranymodelandforanyintegernstrictlygreaterthan0,therearetworealnumberseandfsuchthatforallCNFsofthemodel#(clausesoflengthn)≈ek+f(ofcourseeandfcanbenull,forexampleinmostmodels,thenumberofunitclausesisaconstant).Thecorrelations(foradiscussionaboutstatistictechniquesforNP-completeproblemssee[10])be-tweentheseriesfromthebenchmarkandtheseriespredictedwiththepreviousequationsareabout0.99.ThisisnotsurprisingsincetheCNFsaregeneratedfromthemodelinawayessentiallylineartothebound.Inaddition,wefoundoutthat,inourbenchmark,thelengthofthelongestclauseisthesameforallCNFsgeneratedfromthesamemodel.Figure2(b)displaysthehistogramofthelongestclausesinthenewbenchmark.Notethattwomodels,07and131,whichhavethegreaterlongestclauses,arealsotheoneswitha“nonstandard”percentofbinaryclauses.

Voir