lvaruzza/clcb

=================================
Common Lisp Computational Biology
=================================

The Common Lisp Computational Biology (CLCB) package aims
to be a comprehensive, flexible and easy to use library for
bioinformatics and computational biology.  It competes with
Bio{Perl,Java,Python,Ruby} in terms of modularity and scope. The
intertwined handling of data and control structures in Lisp shall
render this packages as an ideal source for active research in
biological sequence analysis, combining the commodity of readily
available functionality with the Lisp-intrinsic dynamics to work with
new structures.

Scope and Motivation
====================

Common Lisp has the outstanding feature to intertwine control structures
and data.  This appeals particularly to Bioinformatics which is data
driven and deals with structures of enormous complexity. The success
that the statistics suite R has in our field, which has its roots in
Lisp, may be close to a proof of that statement. The here
presented ground work provides a coherent access to

 * objects representing biological sequences
 * the public database Ensembl

The Common Lisp Computational Biology package is intended to be a
comprehensive, flexible and easy to use library for bioinformatics and
computational biology.  It (hopefully) will be comparable to
Bio{Perl,Java,Python,Ruby} in terms of modularity and scope.

There are several other biologically inspired initiatives working with
Lisp. Their functionality does not directly overlap with what is
provided by CLCB. Consequencly, CLCB is more of an addition than a
competitor to BioLisp/BioBike or to ... ?


Modularity
==========

Molecules
---------

A package assisting in modeling molecular biology should firstly provide
molecule-primitives. Today implemented are the classes molecule and
amino-acid.  Secondly one requires means to represent combinations of
molecules (sequence) and operations on these (to caclulate the mass
and other chemical properties).  Every biological object below organelle
level should inherit from the molecule base class.  The definition for
different kinds of molecules plus the fundamental `molecule' class are
bundled in molecule folder named molecule.


Biological Sequences
--------------------

In most other Bio packages, biological sequences are represented
by strings.  Only BioJava uses objects instead of characters for
this purpose.  We believe the latter approach to be much more flexible
and extensible than using just strings: This way, we can model certain
characteristics that would be required to be stored in separate feature
tables, e.g. methylation of Bases in DNA, esoteric nucleotides in RNA
and post-transcriptional modifications of proteins and single amino acids.

The base classes for biological sequences are defined in
`seq/bio-sequences.lisp'


RNA->Protein Translation and Genetic code
-----------------------------------------

The translation of mRNA to proteins is performed by adhering to the
called genetic code.  While the eucaryotic genome is encoded using the
Standard Genetic Code, there are organisms that use a slightly
different code.  To allow for different codes, there is a global
(special) variable *default-genetic-code* that can be set
accordingly. It has to hold one of the codon-tables that are defined
by default.  The list genetic codes is incomplete but will be
updated. The file data/genetic-code.prt lists genetic codes from the
genetic code database.


EnsEMBL
-------

The EnsEMBL genome database is one of the most comprehensive and
valuable sources for biological data in these days.  The directory
`ensembl' contains a package build atop of the CLCB package to enable
access to this resource.  It depends on CLSQL, an SQL engine interface
that supports a number of RDBMS, including MySQL, which is used by
EnsEMBL.

For now, there are only two files with poorly chosen names:

* ensembl/ensembl-classes.lisp: Contains the class definitions.  All
  classes inherit from standard-db-object, which comes from using
  `def-view-class' for defining them.  Those definitions are wrapped
  in additional macros to provide a higher level of abstraction.
  `def-ensembl-view' behaves similar to `dev-view-class', but accepts
  additional keywords to specify if an object has a stable-id or
  provides some sequence information.  The macro badly needs
  documentation (and maybe a partial rewrite).

* ensembl/ensembl-methods.lisp: Contains methods specialized on ensembl
  objects.  Accessing EnsEMBL methods should exclusively be done using
  those methods.  We also included some easy ways to retrieve objects
  from the database using their stable id [There is a _long_ TODO list
  for this file.] 
;; (Since we don't have any other objects similar to those from ensembl
;;  just now, the generic method definitions are included in this file.
;;  They will have to go to another location directly in the CLCB
;;  package as soon as we add gene and protein classes that do not
;;  relate to EnsEMBL.)

Albert Krewinkel
lvaruzza / clcb

About

Languages