PIP Tutorial

icon

40

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

icon

40

pages

icon

English

icon

Documents

Le téléchargement nécessite un accès à la bibliothèque YouScribe Tout savoir sur nos offres

PIP Tutorial
John Conery
April 26, 2005
The pipeline interface program (PIP) is a workflow system for managing complex projects.
Although it was designed primarily for bioinformatics projects, the system should work well
for any project that manages a large number of applications that are invoked by Unix com
mand lines.
PIP uses arule based workflow paradigm. Each individual stepin the workflow is defined by
a rule that tells the system what inputs are required by the step, the application(s) to run to
executethestep,andtheoutputsproducedbythestep. Theinitialinputs,intermediatework
products,andfinaloutputsareallstoredinadatabase,andPIPusestimestampsondatabase
tables to automatically schedule the workflow steps.
This tutorial will explain how to create a PIP workflow by going through the steps in the
development of a project that downloads yeast chromosomes from NCBI and searches for
pairs of genes that may be recent tandem duplicates. The first section is a project overview.
RemainingsectionsshowhowtosetupthedatabaseconnectionandinitialPIPfile,andthen
how to add rules to the workflow in order to implement each step.
To do all the steps in the tutorial you will need to have Perl installed on your workstation,
along with the Bio::Perl library and CPAN modules for accessing a MySQL database and
downloadingfilesviaFTP(seetheSoftwareEnvironmentsectionfordetails). Itispossibleto
dothetutorialifyoudonothaveallthenecessaryPerlmodules;attheendofthedescription
of each step there are ...
Voir icon arrow

Publié par

Nombre de lectures

130

Langue

English

PIP Tutorial
John Conery
April 26, 2005
The pipeline interface program (PIP) is a workflow system for managing complex projects. Although it was designed primarily for bioinformatics projects, the system should work well for any project that manages a large number of applications that are invoked by Unix com-mand lines. PIP uses a rule-based workflow paradigm. Each individual step in the workflow is defined by a rule that tells the system what inputs are required by the step, the application(s) to run to execute the step, and the outputs produced by the step. The initial inputs, intermediate work products, and final outputs are all stored in a database, and PIP uses timestamps on database tables to automatically schedule the workflow steps. This tutorial will explain how to create a PIP workflow by going through the steps in the development of a project that downloads yeast chromosomes from NCBI and searches for pairs of genes that may be recent tandem duplicates. The first section is a project overview. Remaining sections show how to set up the database connection and initial PIP file, and then how to add rules to the workflow in order to implement each step. To do all the steps in the tutorial you will need to have Perl installed on your workstation, along with theBio::Perllibrary and CPAN modules for accessing a MySQL database and downloading files via FTP (see the Software Environment section for details). It is possible to do the tutorial if you do not have all the necessary Perl modules; at the end of the description of each step there are instructions for how to bypass that step and load data directly into your database so you can continue with later steps. This document is a PDF file with embedded hyperlinks. If you are reading it on-line with Adobe Acrobat Reader or the Mac OS/X Preview application hyperlinks will appear in red letters. You can use these links to connect to web pages (e.g. for the Perl libraries described in the Software Environment section) or download files (e.g. scripts mentioned throughout the tutorial). Blue-colored text corresponds to hyperlinks within this document, e.g. you can click on a section name in the table of contents to go directly to that section.
1
CONTENTS
Contents
2
1 Project Overview: Searching for Tandem Duplicates4 1.1 Tandem Duplications. . . . . . . . . . 4. . . . . . . . . . . . . . . . . . . . . . 1.2 Workflow. . . . . . . . . . . . . . . . . . . . . . . . . .  5. . . . . . . . . . . .
2 Software Environment5 2.1 Perl. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .  5 2.2 MySQL 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 CPAN Modules. . . . . . . . . . . . . . . . . . . . . . . .  6. . . . . . . . . . .
3 Database Connections6 3.1 MySQL 6. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Project Database. . . . . . . . . . . 7. . . . . . . . . . . . . . . . . . . . . . . 3.3 Accessing the Database from the Command Line. . . . . . . . . . . . . . . . . 7 3.4 Configuration Files 8. . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . .
4 A Template for PIP Projects9 4.1 Project Home 9. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Directory Structure. . . . . . . . . . . . . . . . . . . . . . .  9. . . . . . . . . . 4.3 Template 10. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . .
5 Structure of a Pipfile10 5.1 Pipfile Organization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 5.2 Standard Variable Definitions 11. . . . . . . .. . . . . . . . . . . . . . . . . . .
6 Running PIP11 6.1 PIP Command. . . . . . . . . . . 11. . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Command Line Options 13. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 Project Status
13
8 Make a List of Yeast Chromosomes15 8.1startupScript. . . . . . . . . . . . . . . . . . . . . . . .  15. . . . . . . . . . . 8.2sourceTable. . . . . . . . . . . 16. . . . . . . . . . . . . . . . . . . . . . . . . 8.3 PIP Rule forsource. . . . . . . . . . . . . . . . . . . . . .  17. . . . . . . . . . 8.4 Top Level Rules. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . .  18
CONTENTS
3
9 Download the Chromosomes from NCBI19 9.1 Query 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Script 20. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Table 22. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.4 Rule. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .  22 9.5 Test the PIP Rule. . . . . . . . . . . 23. . . . . . . . . . . . . . . . . . . . . . .
10 Make Chromosome Descriptions24 10.1 Feature Parser. . . . . . . . . . . . . . . . . . . . . . . . .  24. . . . . . . . . . . 10.2 Wrapper. . . . . . . . . . . . . . . . . . . . . . . . . . .  26. . . . . . . . . . . . 10.3 Table. . . . . . . . . . . . . 26. . . . . . . . . . . . . . . . . . . . . . . . . . . . 10.4 Rule. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
11 Extract Gene Sequences27 11.1 Script 27. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Table 28. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.3 Rule. . . . . . . . . . . . . 28. . . . . . . . . . . . . . . . . . . . . . . . . . . .
12 Use BLAST to Find Duplicates28 12.1 Specification. . . . . . . . . . . 28. . . . . . . . . . . . . . . . . . . . . . . . . 12.2 Preparation. . . . . . . . . . . . 30. . . . . . . . . . . . . . . . . . . . . . . . . 12.3 Stage Class 31. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 12.4 Blast Class 31. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . 12.5 Putting it All Together. . . . . . . . . 32. . . . . . . . . . . . . . . . . . . . . .
13 Reciprocal Best Hits34 13.1 Blast Results 34. . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . 13.2 Reciprocal Hits 36. . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . 13.3 Rule 36. . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . .
14 Tandem Duplicates
15 Wrapping Up
38
39
1 PROJECT OVERVIEW: SEARCHING FOR TANDEM DUPLICATES
1 Project Overview: Searching for Tandem Duplicates
4
1.1 Tandem Duplications A tandem duplication is a mutation that occurs during meiosis. When the parent cell’s chromosomes separate, they often break and recombine so that chromosomes in daughter cells have genes from two different parent chromosomes. If chromosomes break at differ-ent points, one daughter cell may end up with two copies of the same gene and the other daughter cell will be lacking the gene:
If the mutation for the extra gene is not lethal the individuals that carry it may pass the two copies on to successive generations. Eventually, either from genetic drift or because the double copy provides a selective benefit, the new gene may be fixed in the population and becomes a permanent part of the organism’s genome. In this project we will look for possible instances of recent tandem duplications in brewers yeast,Saccharomyces cerevisiae will take a very simple-minded approach and look. We for pairs of genes that are highly similar and located very close to each other on the same chromosome. This project is a nice exercise in building a rule-based pipeline, but isn’t a very realistic example of a true search for tandem duplicates. The fact that two genes are similar and close to each other doesn’t mean they are tandem duplicates (we won’t add any criteria to say how similar or how close they must be) and we may not find all recent tandem pairs (one may have been relocated soon after duplication).
2 SOFTWARE ENVIRONMENT
5
1.2 Workflow The steps in the project workflow are: 1. Build a table with a list of yeast chromosomes and web addresses of theFTP server at NCBIfrom which we can download annotated genome files for the chromosome. 2. Download the genome files. We will be fetchingGenbank reports, which are complete descriptions of the chromosomes, including all the genes. 3. Scan the reports to get chromsome information, including chromsome names and sizes. 4. Run a “feature parser” on each report to pull out the complete set of genes from each chromosome. 5. Create aBLASTdatabase from the full set of genes, and then do an all-vs-all BLAST search where we compare each yeast gene against all the others. 6. A common technique for identifying two genes that are most closely related to each other is to look for “reciprocal best hits”, that is, genesXandYsuch thatYis the best BLAST hit forXand vice versa. 7. Filter the set of reciprocal best hits to find pairs that are on the same chromosome and within 10K bases of each other (regardless of whether there are any other genes in between). Steps 2 through 4 do a bit more work than is necessary, since we can fetch sets of genes from NCBI instead of downloading full chromsome sequences and then scanning the chromsome files to pick out the genes. We chose the full-chromosome approach for several reasons: The feature parser uses a BioPerl library routine to do the parsing. BioPerl is a very useful library, and we wanted to show how to use it in a PIP pipeline. works on any format, not just the NCBI Genbank Report format, soThe BioPerl routine this version of the pipeline will be able to use data from other sources. When we run the feature parser we have more control over the type of information printed for each gene; if we download the gene files we have to rely on parsing the deflines in the FASTA files to get gene names, locations, etc.
2 Software Environment
2.1 Perl PIP is written inPerlwith Perl 5.8, but will probably run under. It was developed and tested 5.6. PIP requires the followingCPANmodules to be installed on the system where it runs: Getopt::Long File::stat
3 DATABASE CONNECTIONS
FileHandle DBI DBD::mysql
6
2.2 MySQL PIP was developed and tested on MySQL 4.0, but earlier versions that support aSHOW TABLE STATUS uses the DBI interface for only a few operations PIPcommand should also work. (mainly checking timestamps on tables); most other operations are done via system calls that invoke themysqlcommand line tool.
2.3 CPAN Modules The scripts used in the tutorial use the following CPAN modules:
Workflow Step Modules Used downloadNet::FTP fpBio::SeqIO (see alsobioperl.org)
If you do not have these modules installed you can still run the tutorial – you can download a table description and table data using links provided in that section of the tutorial, create the table, and load the data. As long as PIP sees the table exists it can run later steps in the workflow.
3 Database Connections 3.1 MySQL PIP currently requires projects to use a MySQL database. Future versions may allow connec-tions to other types of relational databases, but for now you need to have access to a MySQL database. MySQL uses a client-server software architecture. In this type of system, the server is a program that runs continually on a system and accepts requests from clients, which may or may not be on the same system:
3 DATABASE CONNECTIONS
7
When you create a project database, you can access it as you would any other MySQL database, using themysqlcommand line interface or a client such asCocoaMySQLon Mac OS/X or any of a wide variety of MySQL client programs. As far as the database server is concerned, PIP is just another client. What that means for your projects is you can use PIP to run steps in the workflow, but at any time you can access the server to check on the status of the project, or to examine outputs produced by steps that have completed.
3.2 Project Database The first step in starting up a new project to be managed by PIP is to decide where the project database will reside. You need to select a server, which can either be on your own workstation or laptop, or on any remote system your machine can connect to. You need to have a MySQL user account on that server. If you work in a lab or research group with a shared server you may not be able to create your own new database, in which case you need to ask the MySQL administrator to create one. Once the database is created, your access permissions need to be set so you can create new tables in that database. Exercise:Throughout the tutorial, we’ll use the example of a MySQL server run-ning onteleost.cs.uoregon.edu. The example database will be named yeast, and the user name will beconery. You should of course replace these names with the names of your server, database, and account when you type the commands.
3.3 Accessing the Database from the Command Line To verify you can connect to the project database, usemysql, the Unix program that imple-ments a command line interface to a MySQL database. You can run this program on a laptop or workstation by using the-htell it to connect to a remote server.option to  your Specify MySQL user name with the-uoption. Exercise:Open a connection to the server that will host your tandem duplicate database.
3 DATABASE CONNECTIONS
% mysql -h teleost.cs.uoregon.edu -u conery Enter password: ..... mysql> When you see themysql>prompt it means you have connected successfully and you can now enter queries. The first thing is to list the databases and see if your database has been created; you should see something like this: mysql> SHOW DATABASES; +----------+ | Database | +----------+ | mysql | | test | | trna | | yeast | +----------+ Now make sure you can use your database: mysql> USE yeast; Database changed The following command will print a list of tables, along with the number of records and other statistics: mysql> SHOW TABLE STATUS; Since you have a new database, there shouldn’t be any tables yet. Type quit to exit. mysql> quit; Bye
8
3.4 Configuration Files When PIP is accessing your project database you are going to want it to be able to connect automatically – you don’t want to have to be around to type in MySQL passwords each time PIP connects to the database. MySQL allows users to put connection parameters – host name, user name, and password – in a configuration file. Currently PIP only uses the default configuration file, which is named.my.cnfand resides in your Unix home directory. Future versions may also allow project-specific configuration files in the project directories. Exercise:To enable PIP to make connections automatically, create a new file named.my.cnf exactly these four Enterand put it in your home directory. lines (using your own host, user, and password): [client] host = teleost.cs.uoregon.edu user = conery password = .....
4 A TEMPLATE FOR PIP PROJECTS
To keep your password private, set the Unix permissions so only you can read the file: % chmod go-rx .my.cnf % ls -al .my.cnf -rw------- 1 conery conery 65 24 Jan 10:10 .my.cnf Test the connection file by connecting to the server again; this time you should be connected without a prompt for your password: % mysql mysql> You can also specify the name of the database to use on the command line, so you don’t have to typeUSE yeastas the first command: % mysql yeast mysql> show table status; ... Other configuration parameters can be specified in your.my.cnffile, as well; see the MySQL documentation for more information.
4 A Template for PIP Projects
9
4.1 Project Home The next thing you need to do is establish a home directory for the project. When you want to execute steps of the workflow, you will typically cd to this directory and run PIP. When PIP launches Unix programs to carry out the steps those programs will begin execution in this directory. The command file that contains the workflow specification is namedPipfileby default. It should be stored in the project home directory.
4.2 Directory Structure PIP expects to find three subdirectories under the top level project home directory. They can have any names, but the following are suggested:
5 STRUCTURE OF A PIPFILE
10
bin PIP adds thisThis directory will hold any scripts you write for this project. directory to your search path so Unix commands you put in the bodies of rules can refer to the commands directly (more on this in the sections that deal with writing rule bodies). mysql example, the workflow stepThis directory will hold table definitions. For namedgenesthat will be developed for step 4 of this tutorial will produce a set of gene descriptions, and they will go in a table that is also namedgenes. The MySQL commands that create this table will go in a file namedgenes.sql in the mysql subdirectory. stagesIf you are going to implement your own Stage classes, the Perl code for these classes will go in this directory.
Projects can create additional directories as they are needed. For example, the BLAST step in this tutorial will create a new directory namedblastand put the BLAST database in that directory.
4.3 Template An easy way to initialize a new project is to download a project template from the PIP web site. Download a copy oftemplate.tgz, move it to the Unix directory where you want to expand it, and then uncompress and untar the file. You will get a new directory named templateRename the directory to whatever you want, and you’re all set.. Exercise:Download a copy oftemplate.tgzand use it to create the project home directory for the yeast tandem duplicates project: % cd ~/projects % mv ~/Downloads/template.tgz . % tar zxvf template.tgz % mv template yeast % cd yeast % ls Pipfile bin/ mysql/ stages/
5 Structure of a Pipfile 5.1 Pipfile Organization The rules that define each of the project steps are collected into a single file, by default named Pipfileit will read this file, determine which database tables need to. When you start PIP, be updated, and launch the necessary applications to update those tables. Users who are familiar with the Unixmakewill be familiar with the rule syntax andutility organization of a Pipfile:
6 RUNNING PIP
11
PIP ignores any characters following a#; use this character to introduce comments. for variables that will be used throughout the Pipfile.You can define values  Using variables makes it easier to maintain projects; for example, the name of the project database is likely to be used often in the body of rules, and if you define it as a variable name and then use the variable in the rule bodies it will be easier to change later (or to copy the rule to a new Pipfile that has a similar rule). rule header consisting of a table name followed by a colon, andWorkflow rules have a then the names of other tables this table depends on. Lines following the header are indented and can have either Unix commands or invocations of Stage objects (these will be explained in more detail in a later section of the tutorial).
5.2 Standard Variable Definitions All Pipfiles must define values for the three variables shown in the first part of Table5.2. By convention, variable names in Pipfiles (and Makefiles) are written with all capital letters (but PIP allows you to use lower case or mixed case). Exercise:Create a new Pipfile for your project, and initialize it with variable definitions. If you downloaded the project template there is already a Pipfile in your project directory, and all you need to do is change the “XXX” in the definition of the database name to the name of your database. PROJECT = . # project home directory DB = yeast # MySQL database name TABLES = $PROJECT/mysql # find .sql files) here PATH = $PROJECT/bin # find scripts here INC = $PROJECT/stages # find modules here
6 Running PIP
6.1 PIP Command The easiest way to run PIP is to cd to the project directory and just type the name of the program: % pip By default PIP looks for a file named Pipfile, reads the variable definitions and workflow rules, and then begins executing rules. Exercise:Go to your project directory and run PIP on your current Pipfile. There are no rules yet in your Pipfile, but running PIP on an empty rule file is a good way to test your connections to the project database. If everything is working, you will see something like the following in your terminal window:
Voir icon more
Alternate Text