You've gone out and done the research and found a bioinformatics software package you want to install on your own computer. Now what do you do?
When you look for Unix software on the Web, you will find that it's distributed in a number of different formats. Each type of software distribution requires a different type of handling. Some are very simple to install, almost like installing software on a Mac or PC. On the other hand, some software is distributed in a rudimentary form that requires your active intervention to get it running. In order to get this software working, you may have to compile it by hand or even modify the directions that are sent to the compiler so that the program will work on your system. Compiling is the process of converting software from its human-readable form, source code, to a machine-readable executable form. A compiler is the program that performs this conversion.
Software that's difficult to install isn't necessarily bad software. It may be high-quality software from a research group that doesn't have the resources to produce an easy-to-use installation kit. While this is becoming less common, it's still
common enough that you will need to know some things about compiling software.
3.3.1 Unix tar Archives
Software is often distributed as a tar archive, which is short for "tape archive." We discuss tar and other file-compression options in more detail in Chapter 5. Not coincidentally, these archives are one of the most common ways to distribute Unix software on the Internet. tar allows you to download one file that contains the complete image of the developer's working software installation and unpack it right back into the correct subdirectories. If tar is used with the p option, file permissions can even be preserved. This ensures that, if the developer has done a competent job of packing all the required files in the tar archive, you can compile the software relatively easily.
tar archives are often compressed further using either the Unix compress command (indicated by a .tar.Z extension) or with gzip (indicated by a .tar.gz or .tgz extension).
3.3.2 Binary Distributions
Software can be distributed either as uncompiled source code or binaries. If you have a choice, and if you don't know any reason to do otherwise, choose the binary distribution. It will probably save you a lot of headaches.
Binary software distributions are precompiled and (at least in theory) ready to run on your machine. When you download software that is distributed in binary form, you will have a number of options to choose from. For example, the following listing is the contents of the public FTP site for the BLAST sequence alignment software. There are several archives available, each for a different operating system; if you're going to run the software on a Linux workstation, download the file blast.linux.tar.Z.
README.bls 52 Kb Wed Jan 26 18:45:00 2000
blast.alphaOSF1.tar.Z 12756 Kb Wed Jan 26 18:40:00 2000 Unix Tape Archive blast.hpux11.tar.Z 11964 Kb Wed Jan 26 18:43:00 2000 Unix Tape Archive blast.linux.tar.Z 9334 Kb Wed Jan 26 18:41:00 2000 Unix Tape Archive blast.sgi.tar.Z 14746 Kb Wed Jan 26 18:44:00 2000 Unix Tape Archive blast.solaris.tar.Z 12724 Kb Wed Jan 26 18:37:00 2000 Unix Tape Archive blast.solarisintel.tar.Z 10679 Kb Wed Jan 26 18:43:00 2000 Unix Tape Archive blastz.exe 3399 Kb Wed Jan 26 18:44:00 2000 Binary Executable Here are the basic binary installation steps:
1. Download the correct binaries. Be sure to use binary mode when you download. Download and read the instructions (usually a README or INSTALL file).
2. Follow the instructions.
3. Make a new directory and move the archive into it, if necessary.
4. uncompress (*.Z ) or gunzip (*.gz) to uncompress the file.
5. Use tar tf to examine the contents of the archive and tar xvf to extract it.
6. Run configuration and installation scripts, if present.
7. Link binary into a directory in your default path using ln -s, if necessary.
3.3.3 RPM Archives
RPM archives are a new kind of Unix software distribution that has recently become popular. These archives can be unpacked using the command rpm. The Red Hat Package Manager program is included in Red Hat Linux distributions and is automatically installed on your machine when you install Linux. It can also be downloaded freely from
http://www.rpm.org and used on any Linux or other Unix system. rpm creates a software database on your machine and simplifies installations and updates, and even allows you to create RPM archives. RPM archives come in either source or binary form, but aside from the question of selecting the right binary, the installation is equally simple either way.
(As we introduce commands, we'll show you the format of the command line for each command—for example, "Usage:
man name" -- and describe the effects of some options we find most useful.)
Usage: rpm --[options] *.rpm
Here are the important rpm options:
rebuild
Builds a package from a source RPM install
Installs a new package from a binary RPM upgrade
Upgrades existing software uninstall (or erase)
Removes an installed package query
Checks to see if a package is installed verify
Checks information about installed files in a package
3.3.3.1 GnoRPM
Recent versions of Linux that include the GNOME user interface also include an interactive installation tool called GnoRPM. It can be accessed from the System folder in the main GNOME menu. To install software from a CD-ROM with GnoRPM, simply insert and mount the CD-ROM, click the Install button in GnoRPM, and GnoRPM provides a selectable list of every package on the CD-ROM you haven't already installed. You can also uninstall and update packages with GnoRPM, ensuring that the entire package is cleanly removed from your system. GnoRPM informs you if there are package dependencies that require you to download code libraries or other software before completing the installation.
3.3.4 Source Distributions
Sometimes the correct binary isn't available for your system, there's no RPM archive, and you have no choice but to install from source code.
Source distributions can be easy or hard to install. The easy ones come with a configuration script, an install script, and a Makefile for your operating system that holds the instructions to the compiler.
An example of an easy-to-install package is the LessTif source code distribution. LessTif is an open source version of the OSF/Motif window manager software. Motif was developed for high-end workstations and costs a few thousand dollars a year to license; LessTif supports many Motif applications (such as the multiple sequence alignment package ClustalX and the useful 2D plotting package Grace, for example) for free. When the LessTif distribution is unpacked, it looks like:
AUTHORS KNOWN_BUGS acconfig.h configure ltmain.sh BUG-REPORTING Makefile acinclude.m4 configure.in make.out COPYING Makefile.am aclocal.m4 doc missing
COPYING.LIB Makefile.in clients etc mkinstalldirs CREDITS NEWS config.cache include scripts
CURRENT_NOTES NOTES config.guess install-sh test
CVSMake README config.log lib test_build ChangeLog RELEASE-POLICY config.status libtool
INSTALL TODO config.sub ltconfig
Configuration and installation of LessTif on a Linux workstation is a practically foolproof process. As the superuser, move the source tar archive to the /usr/local/src directory. Uncompress and extract the archive. Inside the directory that is created (lesstif or lesstif.0-89, for example), enter ./configure. The configuration script will take a while to run; when it's done, enter make. Compilation will take several minutes; at the end, edit the file /etc/ld.so.conf. Add the line
/usr/lesstif/lib, save the file, and then run ldconfig -v to make the shared LessTif libraries available on your machine.
Complex software such as LessTif is assembled from many different source code modules. The Makefile tells the compiler how to put them together into one large executable. Other programs are simple: they have only one source code file and no Makefile, and they are compiled with a one-line directive to the compiler. You should be able to tell which compiler to use by the extension on the program filename. C programs are often labeled *.c, FORTRAN programs *.f, etc.
To compile a C program, enter gcc program.c -o program; for a FORTRAN program, the command is g77 program.f -o
program. The manpages for the compilers, or the program's documentation (if there is any) should give you the form and possible arguments of the compiler command.
Compilers convert human-readable source code into machine-readable binaries. Each programming language has its own compilers and compiler instructions. Some compilers are free, others are commercial. The compilers you will encounter on Linux systems are gcc, the GNU Project C and C++ compiler, and g77, the GNU Project FORTRAN compiler.[1] In computational biology and bioinformatics, you are likely to encounter programs written in C, C++, FORTRAN, Perl, and Java. Use of other languages is relatively rare. Compilers or interpreters for all these languages are available in open source distributions.
[1] The GNU project is a collaborative project of the Free Software Foundation to develop a completely open source Unix-like operating system. Linux systems are, formally, GNU/Linux systems as they can be distributed under the terms of the GNU Public License (GPL), the license developed by the GNU project.
Difficult-to-install programs come in many forms. One of the main problems you may encounter will be source code with dependencies on code libraries that aren't already installed on your machine. Be sure to check the documentation or the README file that comes with the software to determine whether additional code or libraries are required for the program to run properly.
An example of an undeniably useful program that is somewhat difficult to install is ClustalX, the X windows interface to the multiple sequence alignment program ClustalW. In order to install ClustalX successfully on a Linux workstation, you first need to install the NCBI Toolkit and its included Vibrant libraries. In order to create the Vibrant libraries, you need to install the LessTif libraries and to have XFree86 development libraries installed on your computer.
Here are the basic steps for installing any package from source code:
1. Download the source code distribution. Use binary mode; compressed text files are encoded.
2. Download and read the instructions (usually a README or INSTALL file; sometimes you have to find it after you extract the archive).
3. Make a new directory and move the archive into it, if necessary.
4. uncompress (*.Z ) or gunzip (*.gz) the file.
5. Extract the archive using tar xvf or as instructed.
6. Follow the instructions (did we say that already?).
7. Run the configuration script, if present.
8. Run make if a Makefile is present.
9. If a Makefile isn't present and all you see are *.f or *.c files, use gcc or g77 to compile them, as discussed earlier.
10. Run the installation script, if present.
11. Link the newly created binary executable into one of the binary-containing directories in your path using ln -s (this is usually part of the previous step, but if there is no installation script, you may need to create the link by hand).
3.3.5 Perl Scripts
The Perl language is used to develop web applications and is frequently used by computational biologists. Perl programs (called scripts) have the extension *.pl (or *.cgi if they are web applications). Perl is an interpreted language; in other words, Perl programs don't have to be compiled in order to run. Instead, each command in a Perl script is sent to a program called the Perl interpreter, which executes the commands.[2]
[2] There is now a Perl compiler, which can optionally be used to create binary executables from Perl scripts. This can speed up execution.
To run Perl programs, you need to have the Perl interpreter installed on your machine. Most Linux distributions contain and automatically install Perl. The most recent version of Perl can always be obtained from http://www.perl.com, along with plenty of helpful information about how to use Perl in your own work. We discuss some of the basic elements of Perl in Chapter 12.
3.3.6 Putting It in Your Path
When you give a command, the default path or lookup path is where the system expects to find the program (which is also known as the executable). To make life easier, you can link the binary executable created when you compile a program to a directory like /usr/local/bin, rather than typing the full pathname to the program every time you run it. If you're linking across filesystems, use the command ln -s (which we cover in Chapter 4) to link the command to a directory of executable files. Sometimes this results in the error "too many levels of symbolic links" when you try to run the program. In that case, you have to access the executable directly or use mv or cp to move the actual executable file into the default path. If
you do this, be sure to also move any support files the program needs, or create a link to them in the directory in which the program is being run.
Some software distributions automatically install their executables in an appropriate location. The command that usually does this is make install. Be sure to run this command after the program is compiled. For more information on symbolic linking, refer to one of the Unix references listed in the Bibliography, or consult your system administrator.
3.3.7 Sharing Software Among Multiple Users
Before you start installing software on a Unix system, one of the first things to do is to find out where shared software and data are stored on your machines. It's customary to install local applications in /usr/local, with executable files in
/usr/local/bin. If /usr/local is set up as a separate partition on the system, it then becomes possible to upgrade the operating system without overwriting local software installations.
Maintaining a set of shared software is a good idea for any research group. Installation of a single standard version of a program or software package by the system administrator ensures that every group member will be using software that works in exactly the same way. This makes troubleshooting much easier and keeps results consistent. If one user has a problem running a version of a program that is used by everyone in the group, the troubleshooting focus can fall entirely on the user's input, without muddying the issue by trying to figure out whether a local version of the program was compiled correctly.
For the most part, it's unnecessary for each user of a program to have her own copy of that program residing in a personal directory. The main exception to this is if a user is actually modifying a program for her own use. Such modifications should not be applied to the public, standard version of the program until they have been thoroughly tested, and therefore the user who is modifying the program needs her own version of the program source and executable.