PIP Tutorial

pages

English

Documents

Lire

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

pages

English

Documents

Lire

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

Publié par

Chiwyir

Nombre de lectures

130

Langue

English

PIP Tutorial
John Conery
April 26, 2005
The pipeline interface program (PIP) is a workﬂow system for managing complex projects.
Although it was designed primarily for bioinformatics projects, the system should work well
for any project that manages a large number of applications that are invoked by Unix com
mand lines.
PIP uses arule based workﬂow paradigm. Each individual stepin the workﬂow is deﬁned by
a rule that tells the system what inputs are required by the step, the application(s) to run to
executethestep,andtheoutputsproducedbythestep. Theinitialinputs,intermediatework
products,andﬁnaloutputsareallstoredinadatabase,andPIPusestimestampsondatabase
tables to automatically schedule the workﬂow steps.
This tutorial will explain how to create a PIP workﬂow by going through the steps in the
development of a project that downloads yeast chromosomes from NCBI and searches for
pairs of genes that may be recent tandem duplicates. The ﬁrst section is a project overview.
RemainingsectionsshowhowtosetupthedatabaseconnectionandinitialPIPﬁle,andthen
how to add rules to the workﬂow in order to implement each step.
To do all the steps in the tutorial you will need to have Perl installed on your workstation,
along with the Bio::Perl library and CPAN modules for accessing a MySQL database and
downloadingﬁlesviaFTP(seetheSoftwareEnvironmentsectionfordetails). Itispossibleto
dothetutorialifyoudonothaveallthenecessaryPerlmodules;attheendofthedescription
of each step there are ...

Voir

Publié par

Chiwyir

Nombre de lectures

130

Langue

English

PIP Tutorial

John Conery

April 26, 2005

The pipeline interface program (PIP) is a workﬂow system for managing complex projects. Although it was designed primarily for bioinformatics projects, the system should work well for any project that manages a large number of applications that are invoked by Unix com-mand lines. PIP uses a rule-based workﬂow paradigm. Each individual step in the workﬂow is deﬁned by a rule that tells the system what inputs are required by the step, the application(s) to run to execute the step, and the outputs produced by the step. The initial inputs, intermediate work products, and ﬁnal outputs are all stored in a database, and PIP uses timestamps on database tables to automatically schedule the workﬂow steps. This tutorial will explain how to create a PIP workﬂow by going through the steps in the development of a project that downloads yeast chromosomes from NCBI and searches for pairs of genes that may be recent tandem duplicates. The ﬁrst section is a project overview. Remaining sections show how to set up the database connection and initial PIP ﬁle, and then how to add rules to the workﬂow in order to implement each step. To do all the steps in the tutorial you will need to have Perl installed on your workstation, along with theBio::Perllibrary and CPAN modules for accessing a MySQL database and downloading ﬁles via FTP (see the Software Environment section for details). It is possible to do the tutorial if you do not have all the necessary Perl modules; at the end of the description of each step there are instructions for how to bypass that step and load data directly into your database so you can continue with later steps. This document is a PDF ﬁle with embedded hyperlinks. If you are reading it on-line with Adobe Acrobat Reader or the Mac OS/X Preview application hyperlinks will appear in red letters. You can use these links to connect to web pages (e.g. for the Perl libraries described in the Software Environment section) or download ﬁles (e.g. scripts mentioned throughout the tutorial). Blue-colored text corresponds to hyperlinks within this document, e.g. you can click on a section name in the table of contents to go directly to that section.

CONTENTS

Contents

1 Project Overview: Searching for Tandem Duplicates4 1.1 Tandem Duplications. . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . 1.2 Workﬂow. . . . . . . . . . . . . . . . . . . . . . . . . . 5. . . . . . . . . . . .

2 Software Environment5 2.1 Perl. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.2 MySQL 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 CPAN Modules. . . . . . . . . . . . . . . . . . . . . . . . 6. . . . . . . . . . .

3 Database Connections6 3.1 MySQL 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Project Database. . . . . . . . . . . 7. . . . . . . . . . . . . . . . . . . . . . . 3.3 Accessing the Database from the Command Line. . . . . . . . . . . . . . . . . 7 3.4 Conﬁguration Files 8. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .

4 A Template for PIP Projects9 4.1 Project Home 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Directory Structure. . . . . . . . . . . . . . . . . . . . . . . 9. . . . . . . . . . 4.3 Template 10. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .

5 Structure of a Pipﬁle10 5.1 Pipﬁle Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.2 Standard Variable Deﬁnitions 11. . . . . . . .. . . . . . . . . . . . . . . . . . .

6 Running PIP11 6.1 PIP Command. . . . . . . . . . . 11. . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Command Line Options 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7 Project Status

8 Make a List of Yeast Chromosomes15 8.1startupScript. . . . . . . . . . . . . . . . . . . . . . . . 15. . . . . . . . . . . 8.2sourceTable. . . . . . . . . . . 16. . . . . . . . . . . . . . . . . . . . . . . . . 8.3 PIP Rule forsource. . . . . . . . . . . . . . . . . . . . . . 17. . . . . . . . . . 8.4 Top Level Rules. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 18

CONTENTS

9 Download the Chromosomes from NCBI19 9.1 Query 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Script 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Table 22. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Rule. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 9.5 Test the PIP Rule. . . . . . . . . . . 23. . . . . . . . . . . . . . . . . . . . . . .

10 Make Chromosome Descriptions24 10.1 Feature Parser. . . . . . . . . . . . . . . . . . . . . . . . . 24. . . . . . . . . . . 10.2 Wrapper. . . . . . . . . . . . . . . . . . . . . . . . . . . 26. . . . . . . . . . . . 10.3 Table. . . . . . . . . . . . . 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

11 Extract Gene Sequences27 11.1 Script 27. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Table 28. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Rule. . . . . . . . . . . . . 28. . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 Use BLAST to Find Duplicates28 12.1 Speciﬁcation. . . . . . . . . . . 28. . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Preparation. . . . . . . . . . . . 30. . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Stage Class 31. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Blast Class 31. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Putting it All Together. . . . . . . . . 32. . . . . . . . . . . . . . . . . . . . . .

13 Reciprocal Best Hits34 13.1 Blast Results 34. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Reciprocal Hits 36. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 13.3 Rule 36. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .

14 Tandem Duplicates

15 Wrapping Up

1 PROJECT OVERVIEW: SEARCHING FOR TANDEM DUPLICATES

1 Project Overview: Searching for Tandem Duplicates

1.1 Tandem Duplications A tandem duplication is a mutation that occurs during meiosis. When the parent cell’s chromosomes separate, they often break and recombine so that chromosomes in daughter cells have genes from two different parent chromosomes. If chromosomes break at differ-ent points, one daughter cell may end up with two copies of the same gene and the other daughter cell will be lacking the gene:

If the mutation for the extra gene is not lethal the individuals that carry it may pass the two copies on to successive generations. Eventually, either from genetic drift or because the double copy provides a selective beneﬁt, the new gene may be ﬁxed in the population and becomes a permanent part of the organism’s genome. In this project we will look for possible instances of recent tandem duplications in brewers yeast,Saccharomyces cerevisiae will take a very simple-minded approach and look. We for pairs of genes that are highly similar and located very close to each other on the same chromosome. This project is a nice exercise in building a rule-based pipeline, but isn’t a very realistic example of a true search for tandem duplicates. The fact that two genes are similar and close to each other doesn’t mean they are tandem duplicates (we won’t add any criteria to say how similar or how close they must be) and we may not ﬁnd all recent tandem pairs (one may have been relocated soon after duplication).

2 SOFTWARE ENVIRONMENT

1.2 Workﬂow The steps in the project workﬂow are: 1. Build a table with a list of yeast chromosomes and web addresses of theFTP server at NCBIfrom which we can download annotated genome ﬁles for the chromosome. 2. Download the genome ﬁles. We will be fetchingGenbank reports, which are complete descriptions of the chromosomes, including all the genes. 3. Scan the reports to get chromsome information, including chromsome names and sizes. 4. Run a “feature parser” on each report to pull out the complete set of genes from each chromosome. 5. Create aBLASTdatabase from the full set of genes, and then do an all-vs-all BLAST search where we compare each yeast gene against all the others. 6. A common technique for identifying two genes that are most closely related to each other is to look for “reciprocal best hits”, that is, genesXandYsuch thatYis the best BLAST hit forXand vice versa. 7. Filter the set of reciprocal best hits to ﬁnd pairs that are on the same chromosome and within 10K bases of each other (regardless of whether there are any other genes in between). Steps 2 through 4 do a bit more work than is necessary, since we can fetch sets of genes from NCBI instead of downloading full chromsome sequences and then scanning the chromsome ﬁles to pick out the genes. We chose the full-chromosome approach for several reasons: •The feature parser uses a BioPerl library routine to do the parsing. BioPerl is a very useful library, and we wanted to show how to use it in a PIP pipeline. •works on any format, not just the NCBI Genbank Report format, soThe BioPerl routine this version of the pipeline will be able to use data from other sources. •When we run the feature parser we have more control over the type of information printed for each gene; if we download the gene ﬁles we have to rely on parsing the deﬂines in the FASTA ﬁles to get gene names, locations, etc.

2 Software Environment

2.1 Perl PIP is written inPerlwith Perl 5.8, but will probably run under. It was developed and tested 5.6. PIP requires the followingCPANmodules to be installed on the system where it runs: •Getopt::Long •File::stat

3 DATABASE CONNECTIONS

•FileHandle •DBI •DBD::mysql

2.2 MySQL PIP was developed and tested on MySQL 4.0, but earlier versions that support aSHOW TABLE STATUS uses the DBI interface for only a few operations PIPcommand should also work. (mainly checking timestamps on tables); most other operations are done via system calls that invoke themysqlcommand line tool.

2.3 CPAN Modules The scripts used in the tutorial use the following CPAN modules:

Workﬂow Step Modules Used downloadNet::FTP fpBio::SeqIO (see alsobioperl.org)

If you do not have these modules installed you can still run the tutorial – you can download a table description and table data using links provided in that section of the tutorial, create the table, and load the data. As long as PIP sees the table exists it can run later steps in the workﬂow.

3 Database Connections 3.1 MySQL PIP currently requires projects to use a MySQL database. Future versions may allow connec-tions to other types of relational databases, but for now you need to have access to a MySQL database. MySQL uses a client-server software architecture. In this type of system, the server is a program that runs continually on a system and accepts requests from clients, which may or may not be on the same system:

3 DATABASE CONNECTIONS

When you create a project database, you can access it as you would any other MySQL database, using themysqlcommand line interface or a client such asCocoaMySQLon Mac OS/X or any of a wide variety of MySQL client programs. As far as the database server is concerned, PIP is just another client. What that means for your projects is you can use PIP to run steps in the workﬂow, but at any time you can access the server to check on the status of the project, or to examine outputs produced by steps that have completed.

3.2 Project Database The ﬁrst step in starting up a new project to be managed by PIP is to decide where the project database will reside. You need to select a server, which can either be on your own workstation or laptop, or on any remote system your machine can connect to. You need to have a MySQL user account on that server. If you work in a lab or research group with a shared server you may not be able to create your own new database, in which case you need to ask the MySQL administrator to create one. Once the database is created, your access permissions need to be set so you can create new tables in that database. Exercise:Throughout the tutorial, we’ll use the example of a MySQL server run-ning onteleost.cs.uoregon.edu. The example database will be named yeast, and the user name will beconery. You should of course replace these names with the names of your server, database, and account when you type the commands.—

3.3 Accessing the Database from the Command Line To verify you can connect to the project database, usemysql, the Unix program that imple-ments a command line interface to a MySQL database. You can run this program on a laptop or workstation by using the-htell it to connect to a remote server.option to your Specify MySQL user name with the-uoption. Exercise:Open a connection to the server that will host your tandem duplicate database.

3 DATABASE CONNECTIONS

% mysql -h teleost.cs.uoregon.edu -u conery Enter password: ..... mysql> When you see themysql>prompt it means you have connected successfully and you can now enter queries. The ﬁrst thing is to list the databases and see if your database has been created; you should see something like this: mysql> SHOW DATABASES; +----------+ | Database | +----------+ | mysql | | test | | trna | | yeast | +----------+ Now make sure you can use your database: mysql> USE yeast; Database changed The following command will print a list of tables, along with the number of records and other statistics: mysql> SHOW TABLE STATUS; Since you have a new database, there shouldn’t be any tables yet. Type quit to exit. mysql> quit; Bye —

3.4 Conﬁguration Files When PIP is accessing your project database you are going to want it to be able to connect automatically – you don’t want to have to be around to type in MySQL passwords each time PIP connects to the database. MySQL allows users to put connection parameters – host name, user name, and password – in a conﬁguration ﬁle. Currently PIP only uses the default conﬁguration ﬁle, which is named.my.cnfand resides in your Unix home directory. Future versions may also allow project-speciﬁc conﬁguration ﬁles in the project directories. Exercise:To enable PIP to make connections automatically, create a new ﬁle named.my.cnf exactly these four Enterand put it in your home directory. lines (using your own host, user, and password): [client] host = teleost.cs.uoregon.edu user = conery password = .....

4 A TEMPLATE FOR PIP PROJECTS

To keep your password private, set the Unix permissions so only you can read the ﬁle: % chmod go-rx .my.cnf % ls -al .my.cnf -rw------- 1 conery conery 65 24 Jan 10:10 .my.cnf Test the connection ﬁle by connecting to the server again; this time you should be connected without a prompt for your password: % mysql mysql> You can also specify the name of the database to use on the command line, so you don’t have to typeUSE yeastas the ﬁrst command: % mysql yeast mysql> show table status; ... Other conﬁguration parameters can be speciﬁed in your.my.cnfﬁle, as well; see the MySQL documentation for more information.—

4 A Template for PIP Projects

4.1 Project Home The next thing you need to do is establish a home directory for the project. When you want to execute steps of the workﬂow, you will typically cd to this directory and run PIP. When PIP launches Unix programs to carry out the steps those programs will begin execution in this directory. The command ﬁle that contains the workﬂow speciﬁcation is namedPipfileby default. It should be stored in the project home directory.

4.2 Directory Structure PIP expects to ﬁnd three subdirectories under the top level project home directory. They can have any names, but the following are suggested:

5 STRUCTURE OF A PIPFILE

bin PIP adds thisThis directory will hold any scripts you write for this project. directory to your search path so Unix commands you put in the bodies of rules can refer to the commands directly (more on this in the sections that deal with writing rule bodies). mysql example, the workﬂow stepThis directory will hold table deﬁnitions. For namedgenesthat will be developed for step 4 of this tutorial will produce a set of gene descriptions, and they will go in a table that is also namedgenes. The MySQL commands that create this table will go in a ﬁle namedgenes.sql in the mysql subdirectory. stagesIf you are going to implement your own Stage classes, the Perl code for these classes will go in this directory.

Projects can create additional directories as they are needed. For example, the BLAST step in this tutorial will create a new directory namedblastand put the BLAST database in that directory.

4.3 Template An easy way to initialize a new project is to download a project template from the PIP web site. Download a copy oftemplate.tgz, move it to the Unix directory where you want to expand it, and then uncompress and untar the ﬁle. You will get a new directory named templateRename the directory to whatever you want, and you’re all set.. Exercise:Download a copy oftemplate.tgzand use it to create the project home directory for the yeast tandem duplicates project: % cd ~/projects % mv ~/Downloads/template.tgz . % tar zxvf template.tgz % mv template yeast % cd yeast % ls Pipfile bin/ mysql/ stages/

—

5 Structure of a Pipﬁle 5.1 Pipﬁle Organization The rules that deﬁne each of the project steps are collected into a single ﬁle, by default named Pipfileit will read this ﬁle, determine which database tables need to. When you start PIP, be updated, and launch the necessary applications to update those tables. Users who are familiar with the Unixmakewill be familiar with the rule syntax andutility organization of a Pipﬁle:

6 RUNNING PIP

•PIP ignores any characters following a#; use this character to introduce comments. •for variables that will be used throughout the Pipﬁle.You can deﬁne values Using variables makes it easier to maintain projects; for example, the name of the project database is likely to be used often in the body of rules, and if you deﬁne it as a variable name and then use the variable in the rule bodies it will be easier to change later (or to copy the rule to a new Pipﬁle that has a similar rule). •rule header consisting of a table name followed by a colon, andWorkﬂow rules have a then the names of other tables this table depends on. Lines following the header are indented and can have either Unix commands or invocations of Stage objects (these will be explained in more detail in a later section of the tutorial).

5.2 Standard Variable Deﬁnitions All Pipﬁles must deﬁne values for the three variables shown in the ﬁrst part of Table5.2. By convention, variable names in Pipﬁles (and Makeﬁles) are written with all capital letters (but PIP allows you to use lower case or mixed case). Exercise:Create a new Pipﬁle for your project, and initialize it with variable deﬁnitions. If you downloaded the project template there is already a Pipﬁle in your project directory, and all you need to do is change the “XXX” in the deﬁnition of the database name to the name of your database. PROJECT = . # project home directory DB = yeast # MySQL database name TABLES = $PROJECT/mysql # find .sql files) here PATH = $PROJECT/bin # find scripts here INC = $PROJECT/stages # find modules here —

6 Running PIP

6.1 PIP Command The easiest way to run PIP is to cd to the project directory and just type the name of the program: % pip By default PIP looks for a ﬁle named Pipﬁle, reads the variable deﬁnitions and workﬂow rules, and then begins executing rules. Exercise:Go to your project directory and run PIP on your current Pipﬁle. There are no rules yet in your Pipﬁle, but running PIP on an empty rule ﬁle is a good way to test your connections to the project database. If everything is working, you will see something like the following in your terminal window:

Voir