*********************************** Simple Sequence Manipulation Tools (SISEQ) *********************************** Copyright Notice The copywright of the Siseq Package is owned by the author, Naoki Sato. This package can be distributed freely if it is accompanied with this manual and other documents. This package was developed for academic purposes. Any commercial use must obtain a permission of the author. Users should be responsible for any consequences of the use of this package. The author is not responsible for any hardware trouble, damage of files, or loss of data, which might be caused by the use of this package. Author: Naoki Sato Department of Molecular Biology, Faculty of Science, Saitama University, 255 Shimo-Ohkubo, Urawa 338-8570 Japan E-mail: naokisat@molbiol.saitama-u.ac.jp Home page: http://www.molbiol.saitama-u.ac.jp/~naoki/ Copyright 1998-2000 Naoki Sato Version 1.23 (partially japanized version) March 5, 2000. ***************************** Objectives ***************************** This package has been developed to extract sequence informations from large database entries and to manipulate sequence informations for input to various sequence analysis software. We usually experience that the use of various commercial GUI-based programs such as Genetyx and DNASIS is not suited for these purposes, because of need of a large memory space, a large number of mouse clicks and, consequently, a long time of program operation. In addition, the SISEQ program package has been intended to process multiple sequence files, either in the multiple FASTA format (which is the case in input files for multiple sequence alignment), or in the catenated database enties such as the GenBank and EMBL database releases. The programs in this package do not need a graphical user interface (optionally possible, though), and do not use a large memory to process common sequence files, but perform extraction and manipulation of sequence data rapidly. See README.123 file for the description of the current version 1.2.3. Notice 1. This software was developed primarily for my own research, but I put it in the public domain in the hope that it will be useful. I did my best to write codes with least bugs, but I cannot guarantee that some commands will freeze with a very unusual input file or options on some particular machines. Special attention should be paid for the database file format, which changes rather rapidly during the past years. In no case will the author of this software be responsible for any kind of hardware trouble, file damage, or loss of data, which might occur in the course of the use of this software. Users should use this software on their own risk. 2. If you want to use a database entry that has been downloaded from the network, be sure that the file do not contain extra spaces, extra return codes etc. These will prevent the software from recognizing the file format. In this case, you will have to remove extra lines or extra characters with ordinary text editors such as Edit7 or YouEdit (for the Macintosh). To use a file that has been transferred from different OS (Unix, Mac, Dos), return codes must be corrected. This depends on the software for the transfer (FTP etc). To correct the return codes, use "txtr" command in this software package. Type "txtr cr". On a Macintosh, double click on the txtr icon, and then type " cr". 3. Edit7 is available from http://www.bekkoame.or.jp/~iimori/sw/Edit7.html 4. MEMORY REQUIREMENTS. Although this program do not need an unusually large memory, it requires fairly a large memory to process a large database file. In my experience, about 90 MB memory was needed to process a GenBank file of about 250 MB, or about 180 MB memory for a 500 MB file. In Mac OS, set the memory size of the application to an appropriate value. If necessary, set up a viutual memory in the control panel. In Win OS, use a binary provided with the distribution, which has been tested with a 500 MB file. In UNIX, prepare an enough swap space. Since only a part of the real memory is used for the application, be sure to have enough size of (memory + swap) on your machine. Note that a database file with a large number of CDS (such as those of bacterial genome file) needs a large memory (up to about 30 MB), since all data are stored on memory during the processing of a single database entry (i.e., a sequence with a single ID). 5. CPU performance. In order to process a large database file, a UNIX system is recommended, because of speed. In my experience, the hum1.dat file (500 MB in EMBL format) was processed by the command "cdsnuc hum1.dat outfile s 0 e 0" with SEQ_IMPORT option within 6 min on SUN Ultra 10 with 256 MB memory, 200 MB swap and UltraSparc II CPU 333 MHz. The same operation was completed within about 60 min on a portable Linux box with a Pentium 120 MHz cpu with 40 MB memory and 200 MB swap. It took about 25 min on a Silicon Graphics O2 (R5000, 180 MHz, 128 MB memory, 256 MB swap). On the other hand, it took two hours and a half on PowerMac 7600/200 with 160 MB memory, or 45 min on Windows 98 with Pentium II 350 MHz and 144 MB memory. This difference might reflect the use of swap, rather than the processing itself, and does not suggest anything about the difference of different OS. Nevertheless, a small genomic file such as those of organellar or bacterial genome can be processed within a reasonable time on all platforms. !!!!!!!!!! IMPORTANT !!!!!!!!! Read the README.120 file before attempting to read the usage described below. To read a help message for individual commands, run the siseq program, then type h, and follow instructions. !!!!!!!!!!!!!!!!!!!!!!!!!!!!!! Notes on the usage of Macintosh or Windows version 1. This is not a simple GUI-based program that needs just double clicking and choosing menus. You should read the following explanation for the commands before trying to use the program. All the SISEQ commands are integrated into a single program "siseqFAT" (Macintosh) or "siseqW.exe" (Windows). 2. Put the file to be processed in the folder in which the program resides. 3. Double click on the program icon "siseqFAT". A command input window will appear. 4. Type an appropriate command and options, and then hit return. Japanese characters will not be allowed as the file name. Always use "roman" character but not "hankaku". To use help, type h or help. In the help menu, type e or exit to return to the main menu. 5. A program log window will appear. 6. When "Terminated" line appear, the program is terminated correctly. Close the window by typing the command + Q keys. Alternatively, choose exit from the menu. 7. The output file should be in the same folder. It is a text file with Edit7 as the creator. If you have installed Edit7 on your system, you can open the file by just a double-click. If not, you will have to open the file from a text editor such as Simple Text, YouEdit, or another word processor. ************************ Usage of commands ************************ In the following explanation, and should be replaced by the name of input file and the name of output file, respectively. txtr