fetch (fetch.pl),
home (fetch: Free
Extraction
Tool
for Computational
Humanoid),
CopyLeft(L)
A scientific
(bioinformatics) software for Linux.
Version: 1.3
fetch is an easy swissprot protein
sequence retrieval utility. It is text based. It is available in Perl
and
Linux binary
format compiled by
Perl
compiler (if you want to have other OS binaries from C code, ask us,
but Linux is the fastest and free so it would be better to get one
for yourself for the future). Our usual charge for commercial use is $2,000,000,000
dollars for any of the programs. However if you support GNU flavoured free
software, it is free.
It can extract any sequences you want as long as they are in swissprot
sequence database(primarily). Swissprot is distibuted as a file called
'seq.dat' (around 100 mb, Oct. 1996). fetch creates an index file
(seq.idx) for the entries for speed.
You can create
indexfile with fetch (without any help program), by using -c
option in any directory. As long as the index file is in Swissprot database
dir and pointed by SWISS or SWDIR or SWINDEX envrionment
setting, you can use fetch. Also, you can specify the files at prompt.
The perl index file is much smaller than C generated binary index files.
It is really small compared to other ones, so it would not affect your
disk space.
fetch can also make an index file for any fasta format seq. database.
In that case you can set the environment setting with FASTADB, FASTAINDEX
to
point to the files. These functions are for speed in retrieving only sequences.
If you want to find some sequences in a proprietory fasta seq. db, you
can tell the files to fetch at prompt. So, you can: fetch -af MY_FASTA.fa
MY_FASTA.fa.idx, in fact, you do not need to specify -f option, as
fetch can determine the format automatically. Also, if you have already
created fasta db index file with fetch -c MY_FASTA.fa, you can put
either of the MY_FASTA file(index or seq ), as fetch will look up the directory
where the first file sits.
Example)
-
If you downloaded fetch.pl, you can rename
it to fetch or whatever you like.
-
============= SETTTING UP==========================
-
Suppose your swissprot file is in /usr/db/swiss as 'seq.dat'
-
Your swissprot path ENV is set to SWISS=/usr/db/swiss
-
To make an index file for the first time use, you run fetch: fetch -c
/usr/db/swiss/seq.dat , from either at any dir or specify /usr/db/swiss
dir (if you are already in /usr/db/swiss, you can fetch -c seq.dat ,
this
is recommanded)
-
You created 'seq.idx' now, from the above step.
-
You can copy 'seq.idx' to SWISS prot dir(/usr/db/swiss) or set env
SWINDEX to any path you want to put seq.idx (say, /usr/agb/db/temp/)
-
==============USE of FETCH=======================
-
Now the setup is finished. fetch requires only two files( seq.dat and seq.idx).
-
ex1) fetch -a HUMAN, to fetch all the swissprot sequences from human(STDOUT).
This will show the normal full swissprot
entries.
-
ex2) fetch -a -f HUMAN, to fetch all the swissprot seq. from human
but in fasta format (STDOUT)
-
ex3) fetch *HUMAN, same as ex1)
-
ex4) fetch -f *HUMAN , same as ex2)
-
ex5) fetch YAKO_YEAST , to fetch one single sequence of YAKO_YEAST
-
ex6) fetch -f YAKO_YEAST , same as ex5, but in fasta
format.
-
ex7) fetch -l *YEAST , to get the list of all matches with YEAST
in them. In one column format.
-
ex8) fetch -g YAKO_YEAST , same as above , but in GDF
file format.
-
ex9) fetch -g *YEAST , guess what this should do :-)
-
ex10) fetch C*YEAST , fetches all the yeast seqs which have C in
their names. This will fetch things like COXW_YEAST
as well as YCW2_YEAST,
....
-
ex11) fetch -f C*YEAST , same as above, but fetches FASTA format
sequences only.
-
ex12) fetch -af YEAST s=100 S=200 n=5 , this will get any 5 seq
occurred early which has YEAST in their names in fasta format. However,
the sizes of seq are between 100 and 200.
-
NOTE: When you use glob (*), there
shouldn't be files which match the seq. names in pwd. If there are files
called xxxYEAST in your pwd and if you search for *YEAST, you might get
wrong result. To get around this in the dir, you can use '
' to tell the LINUX
or UNIX shell that they are not files. So, You can fetch 'C*YEAST' safely.
All the options:
-
-h : help
-
-c : create index file (seq.idx)
-
-f : fasta file format output
-
-l : list matched names in swiss
-
-g : output in GDF format
-
-a : all possible matches (globbing)
-
-af : all and fasta format output
-
-s : for specifying species (if you say, HUMAN, it will fetch
all human proteins, but if you say RAT , it will fetch ARATH as well, -s
option prevents getting ARATH but only RAT )
-
n= : number of sequences to fetch in fasta format output (e.g. n=100
, at prompt), this option will automatically set -f option
-
s= : the smallest sequence size (e.g. s=10, to get seq. at least
size of 10 aa) this option will automatically set -f option
-
S= : the largest sequence size (e.g. S=1000, to get seq. less than
1000 aa ) this option will automatically set -f option
Welcome to your bug reports and enhancement requests.(jhp20@cus.cam.ac.uk)
Download
Other programs
License Policy
-
CopyLeft (L), but I support GPL
policy as well.
Jong Park,