***********************************
Simple Sequence Manipulation Tools
(SISEQ)
***********************************
Copyright Notice
The copywright of the Siseq Package is owned by the author, Naoki Sato. This package can be distributed freely if it is accompanied with this manual and other documents. This package was developed for academic purposes. Any commercial use must obtain a permission of the author. Users should be responsible for any consequences of the use of this package. The author is not responsible for any hardware trouble, damage of files, or loss of data, which might be caused by the use of this package.
Author: Naoki Sato
Department of Molecular Biology,
Faculty of Science,
Saitama University,
255 Shimo-Ohkubo, Urawa 338-8570
Japan
E-mail: naokisat@molbiol.saitama-u.ac.jp
Home page: http://www.molbiol.saitama-u.ac.jp/~naoki/
Copyright 1998-2000 Naoki Sato
Version 1.23 (partially japanized version)
March 5, 2000.
*****************************
Objectives
*****************************
This package has been developed to extract sequence informations from large database entries and to manipulate sequence informations for input to various sequence analysis software. We usually experience that the use of various commercial GUI-based programs such as Genetyx and DNASIS is not suited for these purposes, because of need of a large memory space, a large number of mouse clicks and, consequently, a long time of program operation. In addition, the SISEQ program package has been intended to process multiple sequence files, either in the multiple FASTA format (which is the case in input files for multiple sequence alignment), or in the catenated database enties such as the GenBank and EMBL database releases. The programs in this package do not need a graphical user interface (optionally possible, though), and do not use a large memory to process common sequence files, but perform extraction and manipulation of sequence data rapidly.
See README.123 file for the description of the current version 1.2.3.
Notice
1. This software was developed primarily for my own research, but I put it in the public domain in the hope that it will be useful. I did my best to write codes with least bugs, but I cannot guarantee that some commands will freeze with a very unusual input file or options on some particular machines. Special attention should be paid for the database file format, which changes rather rapidly during the past years. In no case will the author of this software be responsible for any kind of hardware trouble, file damage, or loss of data, which might occur in the course of the use of this software. Users should use this software on their own risk.
2. If you want to use a database entry that has been downloaded from the network, be sure that the file do not contain extra spaces, extra return codes etc. These will prevent the software from recognizing the file format. In this case, you will have to remove extra lines or extra characters with ordinary text editors such as Edit7 or YouEdit (for the Macintosh). To use a file that has been transferred from different OS (Unix, Mac, Dos), return codes must be corrected. This depends on the software for the transfer (FTP etc). To correct the return codes, use "txtr" command in this software package. Type "txtr cr". On a Macintosh, double click on the txtr icon, and then type " cr".
3. Edit7 is available from http://www.bekkoame.or.jp/~iimori/sw/Edit7.html
4. MEMORY REQUIREMENTS. Although this program do not need an unusually large memory, it requires fairly a large memory to process a large database file. In my experience, about 90 MB memory was needed to process a GenBank file of about 250 MB, or about 180 MB memory for a 500 MB file. In Mac OS, set the memory size of the application to an appropriate value. If necessary, set up a viutual memory in the control panel. In Win OS, use a binary provided with the distribution, which has been tested with a 500 MB file. In UNIX, prepare an enough swap space. Since only a part of the real memory is used for the application, be sure to have enough size of (memory + swap) on your machine. Note that a database file with a large number of CDS (such as those of bacterial genome file) needs a large memory (up to about 30 MB), since all data are stored on memory during the processing of a single database entry (i.e., a sequence with a single ID).
5. CPU performance. In order to process a large database file, a UNIX system is recommended, because of speed. In my experience, the hum1.dat file (500 MB in EMBL format) was processed by the command "cdsnuc hum1.dat outfile s 0 e 0" with SEQ_IMPORT option within 6 min on SUN Ultra 10 with 256 MB memory, 200 MB swap and UltraSparc II CPU 333 MHz. The same operation was completed within about 60 min on a portable Linux box with a Pentium 120 MHz cpu with 40 MB memory and 200 MB swap. It took about 25 min on a Silicon Graphics O2 (R5000, 180 MHz, 128 MB memory, 256 MB swap). On the other hand, it took two hours and a half on PowerMac 7600/200 with 160 MB memory, or 45 min on Windows 98 with Pentium II 350 MHz and 144 MB memory. This difference might reflect the use of swap, rather than the processing itself, and does not suggest anything about the difference of different OS. Nevertheless, a small genomic file such as those of organellar or bacterial genome can be processed within a reasonable time on all platforms.
!!!!!!!!!! IMPORTANT !!!!!!!!!
Read the README.120 file before attempting to read the usage described below. To read a help message for individual commands, run the siseq program, then type h, and follow instructions.
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
Notes on the usage of Macintosh or Windows version
1. This is not a simple GUI-based program that needs just double clicking and choosing menus. You should read the following explanation for the commands before trying to use the program. All the SISEQ commands are integrated into a single program "siseqFAT" (Macintosh) or "siseqW.exe" (Windows).
2. Put the file to be processed in the folder in which the program resides.
3. Double click on the program icon "siseqFAT". A command input window will appear.
4. Type an appropriate command and options, and then hit return. Japanese characters will not be allowed as the file name. Always use "roman" character but not "hankaku". To use help, type h or help. In the help menu, type e or exit to return to the main menu.
5. A program log window will appear.
6. When "Terminated" line appear, the program is terminated correctly. Close the window by typing the command + Q keys. Alternatively, choose exit from the menu.
7. The output file should be in the same folder. It is a text file with Edit7 as the creator. If you have installed Edit7 on your system, you can open the file by just a double-click. If not, you will have to open the file from a text editor such as Simple Text, YouEdit, or another word processor.
************************
Usage of commands
************************
In the following explanation, and should be replaced by the name of input file and the name of output file, respectively.
txtr
txtr (You can select an option from the menu later.)
This is a text transformer, not specialized in converting sequence files. This command is useful on Mac and Win OS, since various common tools are available on UNIX platform. See help menu for details.
1. Delete spaces from the text.
2. Delete spaces and isolated numbers from the text.
3. Delete a user-defined character from the text.
4. Uppercase to lowercase.
5. Lowercase to uppercase.
6. Delete empty lines.
7. Copy file.
CR or cr. change LF or CR for the current system.
getseq2 n/c
getseq2 (You can type in the remaining options later.)
(Note: getseq in the DOS version)
This command extracts a part of the sequence from a sequence file. You must know the correct region to be extracted before you run this command. "n" and "c" stand for "normal strand" and "complementary strand", respectively. In the case of amino acid sequence, use always "n".
extcds
This command extracts all CDS from a GenBank or EMBL file.
extcds
This command extracts a single CDS from a GenBank or EMBL file. If you use a gene name that is not present in the database file, an error occurs. This is important, because the gene name field of database is quite often incorrect.
cdsnuc
This command extracts all the nucleic acid sequence corresponding to the CDS from a GenBank or EMBL file. You can specify the start point and end point of extraction with reference to the start and end of the CDS. Use "S" or "E" to choose the start or end of CDS, respectively. A lower case character is allowed, i.e., "s" or "e". You should also specify the position from the start or end of CDS by either a positive or a negative integer. If you want to extract CDS from the 20th base before the initiation codon until the 10th base downstream of the termination codon, type
cdsnuc S -20 E 10
Always use a space to separate different arguments in the command line.
If you want to extract a sequence covering the 25 bases upstream of the initiation codon, type
cdsnuc S -25 S -1
Entire nucleic acid sequence corresponding to CDS will be extracted by the command:
cdsnuc S 0 E 0
If you do not need the stop codon, type:
cdsnuc S 0 E -3
cdsnuc S -20 E 10
will extract a single sequence specified by the gene name. If you use a gene name that is not present in the database file, an error occurs. This is important, because the gene name field of database is quite often incorrect.
<< IMPORTANT NOTICE ABOUT THE USE OF CDSNUC>>
(1) Variable SEQ_IMPORT
There are some cases in which various exon sequences are registered as separate database entries. SISEQ can extract exon sequences from these files and combines to a single sequence. This is enabled by default now. But it might cause a problem with binaries compiled with some versions of compilers. In this case, the sequence import functionality can be disabled by setting the environmental variable "SEQ_IMPORT" to false in UNIX. Alternatively, use a siseq script "siseq.cf" such as:
***the following two lines are contents of the siseq.cf file ***********
setvar seq_import false
cdsnuc infile outfile s 0 e 0
*********** end of siseq.cf ********************************************
The siseq script can be used in all platforms, and are read and processed by SISEQ which is run without arguments (i.e., just double-clicking on the icon in Mac or Win).
Another way of setting the environmental variable is to run "siseq" without arguments, type "setvar seq_import false" in the command line. A short message saying the variable has been set true
appears. Then type a command you want to execute such as "cdsnuc infile outfile s 0 e 0".
(2) Variable ADDSEQG
If you prefer to extract a genomic sequence including introns, set the environmental variable ADDSEQG to true. This unfamiliar name of the variable is a name of a soubroutine called "addseq" that extracts genomic sequences. This functionality is disabled by default. To set the variable, you may use one of the three methods as described above for "SEQ_IMPORT": <1> write a siseq script "setvar addseqg true". <2> type "setvar addseqg true" in the command line. <3> set the environmental variable ADDSEQG to ture in UNIX platform.
extrna
extrna
This command extracts all RNA gene sequences from a GenBank or EMBL file.
extrna extrna
This command extracts one kind of RNA gene sequences. If there are several sequences that bear a single name such as "trnS", all the RNA gene sequences of the same name will be extracted. Be sure to use the correct gene name, since "tRNA-Ser" is also used in some database entries.
In this case, ADDSEQG can be set as described for "cdsnuc". SEQ_IMPORT has no effect for "extrna" because there is no such case in the database as far as the author knows.
getclu
This command extracts a part of alignment from in the Clustal format.
toprot
This command includes both "toprot1" and "toprot6" commands in the previous version, and is further extended to use external codon table. The is either 1, 6, or c. "1" indicates single-frame translation. "6" indicates six-frame translation. In both cases, the output is a FASTA file. The switch "c" is a composite output including the nucleic acid sequence and 6-frame translation. If no switch and codon table are used, this program outputs a 6-frame translation using the standard (universal) codon. must be specified if is used. If codon table is not specified, the default standard codon (internal codon table) will be used.
tofast
Converts GenBank, EMBL or FASTA file to a multiple FASTA file. If FASTA file is used as an input file, comment lines beginning with a semicolon as well as non-sequence characters such as asterisks and sharps are removed.
tofast c
Converts a nucleic acid sequence to a complementary sequence.
tofast
Converts nucleic acid sequences to DNA (d) or RNA (r) sequences of the normal (n) or complementary (c) strand. "n" can be omitted.
tofast p
Converts nucleic acid sequences to protein sequences by translating in the specified frame. Frames in the complementary strand can be set with a "minus", i.e., -1, -2 or -3. If is omitted, the default codon table (universal codon table) will be used, while "mt" specifies a mitochondrial codon table. You can specify the name of your own codon table here. To edit the table, see a codon table included in the package ("codontable.uni").
getent
Extracts database entries in according to . The format for the keywords is: AC=xxxxxx ID=xxxxxx DE=xxxxx OS=xxxxxxxxx. Up to 10 keywords can be used. They should be listed in the command line with a space character as a separator. Any database entries that match one of the keywords is copied to outfile, i.e., the keywords are combined by a OR operator. This is compatible with both EMBL and GenBank databases.
genlist (optional)
Lists names of genes in the database file with database IDs. Although all other SISEQ commands need , this command can be used without . In this case, the output is shown on the console. This is practically useful in confirming the gene name before using the command "extcds", "cdsnuc", or "extrna" to extract a single gene.
seqcat
This command catenates the sequences in the two input files, and . The input files may be either single sequence files or clustal files. In the latter case, the output will be a catenated clustal file.
extint
extint
This command extracts intron sequences from a GenBank or EMBL files according to the annotation. The output is a multiple FASTA file. The start and end points should be indecated according to the rule described for cdsnuc or extrna, except that the start and end refer to the start and end points of introns.
noncod <1/0>(optional)
This command extracts non-coding sequences from a GenBank or EMBL file according to the annotation. The output is a multiple FASTA file. The option switch specifies the inclusion of introns in the output. If the switch is 1, introns are included.
chname
This command is used to change interactively the name of the sequence (sequence identifiers) within a sequence file.
chname
If and is specified, the sequence identifier is silently changed. This might not be easy, because the correct old name must be used. If the input file is a single sequence file, can be replaced by a hyphen. The output is a FASTA file, except if the input file is a clustal file.
simtbl
This command writes a table of similarity scores. The input should be a clustal file. This is experimental and is under development.
sites, nucaln, splcod
These are experimental commands. Consult 'help' menus of siseq.
**********************************************
COMMAND LINE MODE
**********************************************
On a UNIX workstation, type a command and options in a single line, i.e.,
siseq tofast infile outfile p 1
If you have installed individual commands, type
tofast infile outfile p 1
Alternatively, type "siseq" and then type commands after the menu has appeared.
On Macintosh and Windows OS, double click on the icon of the program and then,
type commands after the menu has appeared. You can type in a command with all
necessary options in a single line. Alternatively, you can type a single command name, and then, follow the instructions (INTERACTIVE COMMAND LINE MODE).
**********************************************
SCRIPT MODE (siseq script)
**********************************************
SISEQ now uses a script called "siseq.cf", if invoked without argument and the script file is present in the current directory. The script file is easy to edit. Write a single command per line. SISEQ executes the commands line by line. In this mode, all the siseq commands as well as additional commands are available. These include "copy", "remove", "fcat", "setvar", and "system".
copy: copies a file to another file
remove: removes files in the argument list
fcat: add contents of a file to the end of another file
setvar: sets environmental variables used in SISEQ
form circular: force DNA form to circular
form default: (default is automatic, i.e., depends on the word
'circular' in the ID line)
printline xx: set length of line to xx characters
printline default: (default is 75)
addseqg true/false: (see cdsnuc)
seq_import true/false: (see cdsnuc)
system: calls system command (UNIX only)
Use of a siseq script enables an automatic extraction and modification of multiple sequence files. This is true for all platforms including Macintosh and Windows. See sample scripts.
******************************
GRAPHICAL USER INTERFACE MODE
******************************
A Tcl/Tk script called "siseq.tk" is a graphical interface for the SISEQ program. This is slow and is not compatible with all of the functions of the SISEQ, use a command line input for advanced usage of SISEQ. But the Tk graphical interface provides a user with a general idea of the SISEQ package.
On UNIX platforms that have Tcl/Tk installed, just type "siseq.tk" to invoke the SISEQ graphical interface. On Windows, you might need to set the paths. To do this, the use of a batch file is recommended. A sample batch file is provided with the Windows version. Before using the batch file, edit the file to set the correct name of the "Wish" program and the location of "Wish" program.
Use of a GUI in Power Macintosh
1. Read the file README.tk. You need a Tcl/Tk package to use a GUI for SISEQ.
2. Put all necessary files into a single folder. You need "Wish8.1", "Tclapplescript2.0.shlb","siseq.tk" and "siseqFAT", as well as input files. The "Tclapplescript2.0.shlb" file should be located in the System folder after normal installation, but this file might not be recognized by the "Wish" program (at least on my PowerMac). If you use older versions of Wish (8.0 or 4.2), the name of Tclapplescript shared library might be different.
3. Start the "Wish" program by double clicking on the icon of the "Wish" program.
4. Type "source siseq.tk" in the command window of the "Wish" program.
A graphical interface of SISEQ will appear.
5. The "siseq.tk" uses an Applescript extension of Tcl, which calls "siseq.FAT" without arguments.
The command name and other options are saved in the file "siseq.cf". The "siseq.FAT" program reads this file and executes commands just as in the script mode (see below). If a file called
"siseq.cf" is present, it is overwritten by the "siseq.tk" program.
****************************
Contents of the package
****************************
The binary for the Macintosh (FAT binaries that will function on both 68k and PPC).
Manuals
About Siseq Tools(J)iMacwrite II in Japanesej
About Siseq Tools(E).doc (Microsoft Word 6/95 in English)
README.120
README.old
The binaries have been compiled with the CodeWorrior Professional (academic version). Therefore, the binaries should not be used for commercial purpose.
------------ end of file: About Siseq Tools ------------