Lin-Dyer-tutorial-MapReduce

icon

2

pages

icon

Español

icon

Documents

Écrit par

Publié par

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

2

pages

icon

Español

icon

Ebook

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

Data­Intensive Text Processing with MapReduce  Jimmy Lin and Chris Dyer University of Maryland, College Park {jimmylin,redpony}@umd.edu Overview This half‐day tutorial introduces participants to data‐intensive text processing with the MapReduce programming model [1], using the open‐source Hadoop implementation. The focus will be on scalability and the tradeoffs associated with distributed processing of large datasets. Content will include general discussions about algorithm design, presentation of illustrative algorithms, case studies in HLT applications, as well as practical advice in writing Hadoop programs and running Hadoop clusters. Amazon has generously agreed to provide each participant with $100 in Amazon Web Services (AWS) credits that can used toward its Elastic Compute Cloud (EC2) “utility computing” service (sufficient for 1000 instance‐hours). EC2 allows anyone to rapidly provision Hadoop clusters “on the fly” without upfront hardware investments, and provides a low‐cost vehicle for exploring Hadoop. Intended Audience The tutorial is targeted at any NLP researcher interested in data‐intensive processing and scalability issues in general. No background in parallel or distributed computing is necessary, but a prior knowledge of HLT is assumed. Course Objectives • Acquire understanding of the MapReduce programming model and how it relates to alternative approaches to concurrent programming. • ...
Voir Alternate Text

Publié par

Nombre de lectures

50

Langue

Español

DataǦIntensiveTextProcessingwithMapReduceJimmyLinandChrisDyerUniversityofMaryland,CollegePark{jimmylin,redpony}@umd.edu
OverviewThishalfdaytutorialintroducesparticipantstodataintensivetextprocessingwiththeMapReduceprogrammingmodel[1],usingtheopensourceHadoopimplementation.Thefocuswillbeonscalabilityandthetradeoffsassociatedwithdistributedprocessingoflargedatasets.Contentwillincludegeneraldiscussionsaboutalgorithmdesign,presentationofillustrativealgorithms,casestudiesinHLTapplications,aswellaspracticaladviceinwritingHadoopprogramsandrunningHadoopclusters.
Amazonhasgenerouslyagreedtoprovideeachparticipantwith$100inAmazonWebServices(AWS)creditsthatcanusedtowarditsElasticComputeCloud(EC2)“utilitycomputing”service(sufficientfor1000instancehours).EC2allowsanyonetorapidlyprovisionHadoopclusters“onthefly”withoutupfronthardwareinvestments,andprovidesalowcostvehicleforexploringHadoop.
IntendedAudienceThetutorialistargetedatanyNLPresearcherinterestedindataintensiveprocessingandscalabilityissuesingeneral.Nobackgroundinparallelordistributedcomputingisnecessary,butapriorknowledgeofHLTisassumed.
CourseObjectivesAcquireunderstandingoftheMapReduceprogrammingmodelandhowitrelatestoalternativeapproachestoconcurrentprogramming.AcquireunderstandingofhowdataintensiveHLTproblems(e.g.,textretrieval,iterativeoptimizationproblems,etc.)canbesolvedusingMapReduce.AcquireunderstandingofthetradeoffsinvolvedindesigningMapReducealgorithmsandawarenessofassociatedengineeringissues.
TutorialTopicsThefollowingliststopicsthatwillbecovered:
MapReducealgorithmdesignDistributedcountingapplications(e.g.,relativefrequencyestimation)ApplicationstotextretrievalApplicationstographalgorithmsApplicationstoiterativeoptimizationalgorithms(e.g.,EM)PracticalHadoopissuesLimitationsofMapReduce
InstructorBiosJimmyLinisanassistantprofessorintheiSchoolattheUniversityofMaryland,CollegePark.Hejoinedthefacultyin2004aftercompletinghisPh.D.inElectricalEngineeringandComputerScienceatMIT.Dr.Lin’sresearchinterestslieattheintersectionofnaturallanguageprocessingandinformationretrieval.
HeleadstheUniversityofMaryland’seffortintheGoogle/IBMAcademicCloudComputingInitiative.Dr.LinhastaughttwosemesterlongHadoopcourses[2]andhasgivennumeroustalksaboutMapReducetoawideaudience.ChrisDyerisaPh.D.studentattheUniversityofMaryland,CollegePark,intheDepartmentofLinguistics.Hiscurrentresearchinterestsincludestatisticalmachinetranslation,machinelearning,andtherelationshipbetweenartificiallanguageprocessingsystemsandthehumanlinguisticprocessingsystem.HehasservedonprogramcommitteesforAMTA,ACL,COLING,EACL,EMNLP,NAACL,ISWLT,andtheACLWorkshopsonMachinetranslation,andisoneofthedevelopersoftheMosesopensourcemachinetranslationtoolkit.HehaspracticalexperiencesolvingNLPproblemswithboththeHadoopMapReduceframeworkandGoogle’sMapReduceimplementation,whichwasmadepossiblebyaninternshipwithGoogleResearchin2008.AcknowledgmentsThisworkissupportedbyNSFunderawardsIIS0705832andIIS0836560;theIntramuralResearchProgramoftheNIH,NationalLibraryofMedicine;DARPA/IPTOContractNo.HR00110620001undertheGALEprogram.Anyopinions,findings,conclusions,orrecommendationsexpressedherearetheinstructors’anddonotnecessarilyreflectthoseofthesponsors.WearegratefultoAmazonforitssupportoftutorialparticipants.References[1]Dean,JeffreyandSanjayGhemawat.MapReduce:SimplifiedDataProcessingonLargeClusters.Proceedingsofthe6thSymposiumonOperatingSystemDesignandImplementation(OSDI2004),p.137150,2004,SanFrancisco,California.[2]JimmyLin.ExploringLargeDataIssuesintheCurriculum:ACaseStudywithMapReduce.ProceedingsoftheThirdWorkshoponIssuesinTeachingComputationalLinguistics(TeachCL08)atACL2008,p.5461,2008,Columbus,Ohio.
Voir Alternate Text
  • Univers Univers
  • Ebooks Ebooks
  • Livres audio Livres audio
  • Presse Presse
  • Podcasts Podcasts
  • BD BD
  • Documents Documents
Alternate Text