2. CONVENTIONS USED IN THE DATA BANK
The following sections describes the general conventions used in SWISS-
PROT to  achieve uniformity  of presentation.  Experienced users of the
EMBL Database can skip these sections and directly refer to Appendix C,
which lists  the minor  differences in  format  between  the  two  data
collections.
   2.1 General structure of the data bank
 The SWISS-PROT  protein sequence  data bank  is  composed  of  sequence
entries. Each  entry corresponds  to a  single contiguous  sequence  as
contributed to  the bank  or reported in the literature. In some cases,
entries have been assembled from several papers that report overlapping
sequence regions.  Conversely, a  single paper  can  provide  data  for
several entries,  e.g. when  related sequences from different organisms
are reported.
References to  positions within  a sequence  are made  using sequential
numbering, beginning with 1 at the N-terminal end of the sequence.
Except for  initiator N-terminal  methionine residues,  which  are  not
included in  a sequence when their absence from the mature sequence has
been proven,  the sequence  data correspond  to the precursor form of a
protein before post-translational modifications and processing.
2.2 Classes of data
In order  to attempt  to make  data available  to users  as quickly  as
possible after  publication, SWISS-PROT  entries may be released before
all their  details are finalized. The concept of data classes gives the
user some  idea of  the areas  in which  the data still require further
work. The  class of  each entry  is indicated on the first (ID) line of
the entry. At present two classes are supported:
STANDARD   :   Data which  are complete  to the  standards laid down by
               the SWISS-PROT data bank.
PRELIMINARY:   Data for  which  only  the  sequence  and  bibliographic
               information have been submitted to thorough checks.
2.3 Structure of a sequence entry
The entries  in the  SWISS-PROT data  bank are  structured so  as to be
usable  by   human  readers  as  well  as  by  computer  programs.  The
explanations, descriptions,  classifications and  other comments are in
ordinary English.  Wherever possible,  symbols familiar to biochemists,
protein chemists and molecular biologists are used.
Each sequence  entry is  composed of  lines. Different  types of lines,
each with  their own  format, are used to record the various data which
make up  the entry.  A sample sequence entry is shown in the next three
pages.
ID   TNFA_HUMAN     STANDARD;      PRT;   233 AA.
AC   P01375;
DT   21-JUL-1986 (REL. 01, CREATED)
DT   21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT   01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE)
DE   TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN).
GN   TNFA.
OS   HOMO SAPIENS (HUMAN).
OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC   EUTHERIA; PRIMATES.
RN   [1]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 87217060.
RA   NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A.,
RA   AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N.,
RA   FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A.,
RA   CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.;
RL   COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986).
RN   [2]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 85086244.
RA   PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R.,
RA   PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.;
RL   NATURE 312:724-729(1984).
RN   [3]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 85137898.
RA   SHIRAI T., YAMAGUCHI H., ITO H., TODD C.W., WALLACE R.B.;
RL   NATURE 313:803-806(1985).
RN   [4]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 86016093.
RA   NEDWIN G.E., NAYLOR S.L., SAKAGUCHI A.Y., SMITH D.H.,
RA   JARRETT-NEDWIN J., PENNICA D., GOEDDEL D.V., GRAY P.W.;
RL   NUCLEIC ACIDS RES. 13:6361-6373(1985).
RN   [5]
RP   SEQUENCE FROM N.A.
RX   MEDLINE; 85142190.
RA   WANG A.M., CREASEY A.A., LADNER M.B., LIN L.S., STRICKLER J.,
RA   VAN ARSDELL J.N., YAMAMOTO R., MARK D.F.;
RL   SCIENCE 228:149-154(1985).
RN   [6]
RP   X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS).
RX   MEDLINE; 90008932.
RA   ECK M.J., SPRANG S.R.;
RL   J. BIOL. CHEM. 264:17595-17605(1989).
RN   [7]
RP   X-RAY CRYSTALLOGRAPHY (2.9 ANGSTROMS).
RX   MEDLINE; 91193276.
RA   JONES E.Y., STUART D.I., WALKER N.P.;
RL   J. CELL SCI. SUPPL. 13:11-18(1990).
RN   [8]
RP   X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS).
RX   MEDLINE; 90008932.
RA   ECK M.J., SPRANG S.R.;
RL   J. BIOL. CHEM. 264:17595-17605(1989).
RN   [9]
RP   MUTAGENESIS.
RX   MEDLINE; 91184128.
RA   OSTADE X.V., TAVERNIER J., PRANGE T., FIERS W.;
RL   EMBO J. 10:827-836(1991).
RN   [10]
RP   MYRISTOYLATION.
RX   MEDLINE; 93018820.
RA   STEVENSON F.T., BURSTEN S.L., LOCKSLEY R.M., LOVETT D.H.;
RL   J. EXP. MED. 176:1053-1062(1992).
CC   -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN
CC       CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED
CC       IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING
CC       FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT
CC       CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION
CC       UNDER CERTAIN CONDITIONS.
CC   -!- SUBUNIT: HOMOTRIMER.
CC   -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS
CC       AN EXTRACELLULAR SOLUBLE FORM.
CC   -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY
CC       PROTEOLYTIC PROCESSING.
CC   -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING
CC       CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL
CC       HEALTH AND MALNUTRITION.
CC   -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY.
DR   EMBL; X02910; HSTNFA.
DR   EMBL; M16441; HSTNFAB.
DR   EMBL; X01394; HSTNFR.
DR   EMBL; M10988; HSTNFAA.
DR   PIR; B23784; QWHUN.
DR   PIR; A44189; A44189.
DR   PDB; 1TNF; 15-JAN-91.
DR   PDB; 2TUN; 31-JAN-94.
DR   MIM; 191160; 11TH EDITION.
DR   PROSITE; PS00251; TNF.
KW   CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR;
KW   MYRISTYLATION; 3D-STRUCTURE.
FT   PROPEP        1     76
FT   CHAIN        77    233       TUMOR NECROSIS FACTOR.
FT   TRANSMEM     36     56       SIGNAL-ANCHOR (TYPE-II PROTEIN).
FT   LIPID        19     19       MYRISTATE.
FT   LIPID        20     20       MYRISTATE.
FT   DISULFID    145    177
FT   MUTAGEN     108    108       R->W: BIOLOGICALLY INACTIVE.
FT   MUTAGEN     112    112       L->F: BIOLOGICALLY INACTIVE.
FT   MUTAGEN     162    162       S->F: BIOLOGICALLY INACTIVE.
FT   MUTAGEN     167    167       V->A,D: BIOLOGICALLY INACTIVE.
FT   MUTAGEN     222    222       E->K: BIOLOGICALLY INACTIVE.
FT   CONFLICT     63     63       F -> S (IN REF. 5).
FT   STRAND       89     93
FT   TURN         99    100
FT   TURN        109    110
FT   STRAND      112    113
FT   TURN        115    116
FT   STRAND      118    119
FT   STRAND      124    125
FT   STRAND      130    143
FT   STRAND      152    159
FT   STRAND      166    170
FT   STRAND      173    174
FT   TURN        183    184
FT   STRAND      189    202
FT   TURN        204    205
FT   STRAND      207    212
FT   HELIX       215    217
FT   STRAND      218    218
FT   STRAND      227    232
SQ   SEQUENCE   233 AA;  25644 MW;  279986 CN;
     MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR
     EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR
     DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE
     TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL
    //
Each line  begins with  a two-character  line code, which indicates the
type of  data contained  in the  line. The  current line types and line
codes and the order in which they appear in an entry, are shown below:
ID     - Identification.
AC     - Accession number(s).
DT     - Date.
DE     - Description.
GN     - Gene name(s).
OS     - Organism species.
OG     - Organelle.
OC     - Organism classification.
RN     - Reference number.
RP     - Reference position.
RC     - Reference comments.
RX     - Reference cross-references.
RA     - Reference authors.
RL     - Reference location.
CC     - Comments or notes.
DR     - Database cross-references.
KW     - Keywords.
FT     - Feature table data.
SQ     - Sequence header.
       - (blanks) sequence data.
//     - Termination line.
Some entries  do not contain all of the line types, and some line types
occur many  times in  a single  entry. Each  entry must  begin with  an
identification line  (ID) and  end with  a  terminator  line  (//).  In
addition the  following line  types are  always present in an entry: AC
(once), DT  (3 times),  DE (1 or more), OS (1 or more), OC (1 or more),
RN (1  or more),  RP (1  or more),  RA (1  or more), RL (1 or more), SQ
(once), and  at least one sequence data line. The other line types (GN,
OG, RC, RM, CC, DR, KW and FT) are optional.
A detailed  description of  each line type is given in the next section
of this document.
It must  be noted  that all  SWISS-PROT line  types exist  in the  EMBL
Database. A  description of  the format  differences between the SWISS-
PROT and EMBL data banks is given in Appendix C of this document.
The two-character  line type  code which  begins each  line  is  always
followed by  three blanks,  so that  the actual information begins with
the sixth  character. Information  is  not  extended  beyond  character
position 75.
                   3. THE DIFFERENT LINE TYPES
3.1 The ID line
The ID  (IDentification) line is always the first line of an entry. The
general form of the ID line is:
ID   ENTRY_NAME   DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.
     3.1.1 Entry Name
The first  item on  the ID line is the entry name of the sequence. This
name is  a useful  means of  identifying a  sequence.  The  entry  name
consists of up to ten uppercase alphanumeric characters.
SWISS-PROT uses  a general  purpose  naming  convention  which  can  be
symbolized as X_Y, where
X  is a mnemonic code of at most 4 alphanumeric characters representing
   the protein name. Examples: B2MG is for Beta-2-microglobulin, HBA is
   for Hemoglobin alpha chain and INS is for Insulin.
The `_' sign serves as a separator.
Y  is a  mnemonic species identification code of at most 5 alphanumeric
   characters representing  the biological  source of the protein. This
   code is  generally made  of the first three letters of the genus and
   the first  two letters  of  the  species.  Examples:  PSEPU  is  for
   Pseudomonas putida and NAJNI is for Naja nivea.
   However, for  species commonly  encountered in  the data bank, self-
   explanatory codes  are used.  There are 16 of those codes. They are:
   BOVIN for  Bovine, CHICK  for Chicken,  ECOLI for  Escherichia coli,
   HORSE for Horse, HUMAN for Human, MAIZE for Maize (Zea mays) , MOUSE
   for Mouse,  PEA for  Garden pea  (Pisum sativum), PIG for Pig, RABIT
   for Rabbit, RAT for Rat, SHEEP for Sheep, SOYBN for Soybean (Glycine
   max), TOBAC  for Common  tobacco (Nicotina tabacum), WHEAT for Wheat
   (Triticum  aestivum),   YEAST  for   Baker's  yeast   (Saccharomyces
   cerevisiae).
   As it  was not  possible to  apply the  above rules to viruses, they
   were given arbitrary, but generally easy to remember, identification
   codes. In some cases it was not possible to assign a definitive code
   to a species. In these cases a temporary code was chosen.
Examples of  complete protein  sequence entry  names are: RL1_ECOLI for
ribosomal protein  L1 from  Escherichia coli,  FER_HALHA for ferredoxin
from Halobacterium halobium.
The name  of all the presently defined species identification codes are
listed in the SWISS-PROT document file SPECLIST.TXT.
     3.1.2 Data class
The second  item on  the ID  line indicates the data class of the entry
(see section 2.2).
     3.1.3 Molecule type
The third  item on  the ID  line is a three letter code which indicates
the type  of molecule  of the  entry: in  SWISS-PROT  it  is  PRT  (for
PRoTein).
     3.1.4 Length of the molecule
The fourth  and last item of the ID line is the length of the molecule,
which is  the total  number of amino acids in the sequence. This number
includes the  positions reported  to be present but which have not been
determined (coded as `X'). The length is followed by the letter code AA
(Amino Acids).
     3.1.5 Examples of identification lines
Two examples of ID lines are shown below:
ID   CYC_BOVIN      STANDARD;      PRT;   104 AA.
ID   GIA2_GIALA     PRELIMINARY;   PRT;   296 AA.
3.2 The AC line
The AC  (ACcession number)  line lists the accession numbers associated
with an entry. An example of an accession number line is shown below:
AC   P00321; P05348;
The accession  numbers are  separated by  semicolons and  the  list  is
terminated by  a semicolon. If necessary, more then one AC line will be
used.  Most   SWISS-PROT  sequence  entries  currently  have  only  one
accession number.
The purpose  of accession  numbers  is  to  provide  a  stable  way  of
identifying entries  from release to release. It is sometimes necessary
for reasons  of consistency  to change  the names  of the  entries, for
example, to ensure that related entries have similar names. However, an
accession number  is always conserved, and therefore allows unambiguous
citation of SWISS-PROT entries.
Researchers who  wish to  cite entries  in  their  publications  should
always cite the first accession number.
Entries will  have more  than one  accession number  if they  have been
merged or  split. For  example, when two entries are merged into one, a
new accession  number goes  at the start of the AC line, and those from
the merged entries are listed after this one. Similarly, if an existing
entry is  split into  two or  more entries  (a  rare  occurrence),  the
original accession number list is retained in all the derived entries.
An accession  number is  dropped only  when the  data to  which it  was
assigned have been completely removed from the data bank.
3.3 The DT line
The DT  (DaTe) lines show the date of entry or last modification of the
sequence entry. The format of the DT lines is:
DT   DD-MMM-YEAR (REL. XX, COMMENT)
where `DD'  is the  day, `MMM' the month, `YEAR' the year, and `XX' the
SWISS-PROT release  number. The  comment portion  of the line indicates
the action  taken on that date. There are ALWAYS three DT lines in each
entry, each of them is associated with a specific comment:
-  The first  DT line  indicates when  the entry  first appeared in the
   data bank. The associated comment is `CREATED'.
-  The second  DT line  indicates  when  the  sequence  data  was  last
   modified. The associated comment is `LAST SEQUENCE UPDATE'.
-  The third  DT line  indicates when  any data other then the sequence
   was last  modified.  The  associated  comment  is  `LAST  ANNOTATION
   UPDATE'.
Example of a block of DT lines:
DT   01-JAN-1988 (REL. 06, CREATED)
DT   01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
DT   01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)
3.4 The DE line
The DE  (DEscription) lines  contain  general  descriptive  information
about the  sequence stored. This information is generally sufficient to
identify the sequence precisely. The format of the DE lines is:
DE   DESCRIPTION.
The description  is given  in ordinary  English and  is free-format. In
some cases,  more than  one DE line is required; in this case, the text
is divided  only between  words and only the last DE line is terminated
by a period.
When the  complete sequence  was not  determined the  last  information
given on the DE lines will be `(FRAGMENT)' or `(FRAGMENTS)'.
Two examples of description lines are given here:
DE   NADH DEHYDROGENASE (EC 1.6.99.3).
DE   LYSOPINE DEHYDROGENASE (EC 1.5.1.16) (OCTOPINE SYNTHASE)
DE   (LYSOPINE SYNTHASE) (FRAGMENT).
3.5 The GN line
The GN (Gene Name) line contains the name(s) of the gene(s) that encode
for the stored protein sequence. The format of the GN line is:
GN   NAME1[ AND|OR NAME2...].
Examples:
GN   ALB.
GN   REX-1.
It often  occurs that  more than  one gene name has been assigned to an
individual locus.  In that  case all  the synonyms  will be listed. The
word `OR'  separates the  different designations. The first name in the
list is  assumed to  be the most correct (or most current) designation.
Example:
GN   HNS OR DRDX OR OSMZ OR BGLY.
In a  few  cases,  multiple  genes  encode  for  an  identical  protein
sequence. In that case all the different gene names will be listed. The
word `AND' separates the designations. Example:
GN   CECA1 AND CECA2.
In very  rare cases  `AND' and  `OR' can  both be present. In that case
parenthesis are used as shown in the following example:
GN   GVPA AND (GVPB OR GVPA2).
3.6 The KW line
The KW  (KeyWord) lines  provide  information  which  can  be  used  to
generate cross-reference  indexes of  the  sequence  entries  based  on
functional, structural,  or other  categories. The  keywords chosen for
each entry serve as a subject reference for the sequence. Often several
KW lines  are necessary  for a single entry. The format of the KW lines
is:
KW   KEYWORD[; KEYWORD...].
More than  one keyword  may be listed on each KW line; the keywords are
separated by  semicolons, and the last keyword is followed by a period.
Keywords may  consist of  more than one word (they may contain blanks),
but are never split between lines. An example of a KW line is:
KW   EYE LENS PROTEIN; ACETYLATION.
The order  of the  keywords is not significant. The above example could
also have been written:
KW   ACETYLATION; EYE LENS PROTEIN.
3.7 The OS line
The OS  (Organism Species) line specifies the organism(s) which was the
source of  the stored  sequence. In the rare case where all the species
information will  not fit  on a  single line  more than  one OS line is
used. The last OS line is terminated by a period.
The species designation consists, in most cases, of the Latin genus and
species designation  followed by the English name (in parentheses). For
viruses, only  the common  English name  is given.  In  cases  where  a
protein sequence  is identical  in more then one species the OS line(s)
will list the names of all those species.
Examples of OS lines are shown here:
OS   ESCHERICHIA COLI.
OS   HOMO SAPIENS (HUMAN).
OS   ROUS SARCOMA VIRUS (STRAIN SCHMIDT-RUPPIN).
OS   NAJA NAJA (INDIAN COBRA), AND NAJA NIVEA (CAPE COBRA).
3.8 The OG line
The OG  (OrGanelle) lines  indicate if  the gene  coding for  a protein
originates from  the mitochondria,  the chloroplast,  a cyanelle,  or a
plasmid. The format of the OG line is:
OG   CHLOROPLAST.
OG   CYANELLE.
OG   MITOCHONDRION.
OG   PLASMID name.
Where 'name' is the name of the plasmid.
3.9 The OC line
The  OC   (Organism  Classification)   lines  contain   the   taxonomic
classification of  the source  organism. The  classification is  listed
top-down as  nodes in  a taxonomic  tree  in  which  the  most  general
grouping is  given first.  The classification  may be  distributed over
several OC  lines, but nodes are not split or hyphenated between lines.
The individual  items are  separated by  semicolons  and  the  list  is
terminated by a period. The format of the OC lines is:
OC   NODE[; NODE...].
For example the classification lines for a human sequence would be:
OC   EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC   EUTHERIA; PRIMATES.
3.10 The reference (RN, RP, RC, RX, RA, RL) lines
These lines  comprise the  literature citations  within SWISS-PROT. The
citations indicate  the papers from which the data has been abstracted.
The reference  lines for  a given  citation occur  in a  block, and are
always in  the order RN, RP, RC, RM, RA, RL. Within each such reference
block the  RN and  RP lines occur once, the RC line occurs zero or more
times, the  RM line  occurs zero  or once, and the RA and RL lines each
occur one or more times. If several references are given, there will be
a reference block for each.
An example of a complete reference is:
RN   [1]
RP   SEQUENCE FROM N.A., AND SEQUENCE OF 1-15.
RC   STRAIN=SPRAGUE-DAWLEY; TISSUE=LIVER;
RM   91002678
RA   CHAN Y.-L., PAZ V., OLVERA J., WOOL I.G.;
RL   BIOCHIM. BIOPHYS. ACTA 1050:69-73(1990).
The formats of the individual lines are explained below.
     3.10.1 The RN line
The RN  (Reference Number)  line gives  a  sequential  number  to  each
reference citation  in an  entry. This  number is  used to indicate the
reference in  comments and  feature table  notes. The  format of the RN
line is:
RN   [N]
where N  denotes the nth reference for this entry. The reference number
is always enclosed in square brackets.
     3.10.2 The RP line
The RP  (Reference Position)  line describes  the extent  of  the  work
carried out by the authors of the reference cited. The format of the RP
line is:
RP   COMMENT.
Typical examples of RP lines are shown below:
RP   SEQUENCE FROM N.A.
RP   SEQUENCE FROM N.A., AND SEQUENCE OF 12-35.
RP   SEQUENCE OF 34-56; 67-73 AND 123-345, AND DISULFIDE BONDS.
RP   REVISIONS TO 67-89.
RP   STRUCTURE BY NMR.
RP   X-RAY CRYSTALLOGRAPHY (1.8 ANGSTROMS).
RP   CHARACTERIZATION.
RP   MUTAGENESIS OF TYR-56.
RP   REVIEW.
RP   VARIANT ALA-58.
RP   VARIANTS XLI LEU-341; ARG-372 AND TYR-446.
     3.10.3 The RC line
The RC  (Reference Comment)  lines are optional lines which are used to
store comments  relevant to  the reference  cited. The format of the RC
line is:
RC   TOKEN1=TEXT; TOKEN2=TEXT; .....
Where the currently defined tokens are:
     PLASMID
     SPECIES
     STRAIN
     TISSUE
     TRANSPOSON
The `SPECIES'  token is  only used  when an  entry describes a sequence
which is identical in more than one species; similarly the `PLASMID' is
only used  if an  entry describes a sequence identical in more than one
plasmid.
An example of an RC line is:
RC   STRAIN=SPRAGUE-DAWLEY; TISSUE=LIVER;
  3.10.4 The RX line
The RX  (Reference cross-reference)  line is  an optional line which is
used to  indicate the  identifier assigned to a specific reference in a
bibliographic database. The format of the RX line is:
RX   BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER.
where the  valid bibliographic  database  names  and  their  associated
identifier are:
Name:       MEDLINE
Database:   Medline from the National Library of Medicine (NLM)
Identifier: Eight digit Medline Unique Identifier (UID)
Example of RX line:
RX   MEDLINE; 91002678.
     3.10.5 The RA line
The RA (Reference Author) lines list the authors of the paper (or other
work) cited.  All of  the authors  are included,  and are listed in the
order given  in the  paper. The names are listed surname first followed
by a  blank followed by initial(s) with periods. The authors' names are
separated by commas and terminated by a semicolon. Author names are not
split between lines. An example of the use of RA lines is shown below:
RA   YANOFSKY C., PLATT T., CRAWFORD I.P., NICHOLS B.P., CHRISTIE G.E.,
RA   HOROWITZ H., VAN CLEEMPUT M., WU A.M.;
As many RA lines as necessary are included for each reference.
     3.10.6 The RL line
The RL  (Reference Location)  lines contain  the conventional  citation
information for  the reference.  In general,  the RL  lines  alone  are
sufficient to find the paper in question.
a) Journal citations
The RL  line for  a journal citation includes the journal abbreviation,
the volume  number, the page range, and the year. The format for such a
RL line is:
RL   JOURNAL VOL:PP-PP(YEAR).
Journal names  are abbreviated according to the conventions used by the
National Library  of Medicine  (NLM) and  are based on the existing ISO
and ANSI  standards. A  list of  the abbreviations  currently in use is
given in the SWISS-PROT document file JOURLIST.TXT.
An example of an RL line is:
RL   J. MOL. BIOL. 168:321-331(1983).
When a  reference is  made to  a paper  which is `in press' at the time
when the  data bank  is released,  the page  range, and  eventually the
volume number  are indicated  as '0' (zero). An example of a RL line of
such type is shown here:
RL   NUCLEIC ACIDS RES. 22:0-0(1994).
b) Book citations
A variation  of the RL line format is used for papers found in books or
other similar publications, which are cited as shown below:
RL   (IN) THE ENZYMES, 3RD ED., VOL.11, PART A, BOYER P.D., ED.,
RL   PP.397-547, ACADEMIC PRESS, NEW YORK, (1975).
The first RL line contains the designation `(IN)', which indicates that
this is  a  book  reference.  These  citations  generally  include  the
following  information:  the  title  of  the  book,  the  name  of  the
editor(s), the  page range,  the publisher  name, the  city where it is
published, and  the year  of publication (which is always shown between
parenthesis).
c) Unpublished results
RL lines  for unpublished  results follows  the  format  shown  in  the
following example:
RL   UNPUBLISHED RESULTS, CITED BY:
RL   ULRICH E.L., KROGMANN D.W., MARKLEY J.L.;
RL   J. BIOL. CHEM. 257:9356-9364(1982).
d) Unpublished observations
For unpublished observations the format of the RL line is:
RL   UNPUBLISHED OBSERVATIONS (MMM-YEAR).
Where `MMM' is the month and `YEAR' is the year.
We use the `unpublished observations' RL line to cite communications by
scientists to  SWISS-PROT of unpublished information concerning various
aspects of a sequence entry.
e) Thesis
For Ph.D. theses the format of the RL line is:
RL   THESIS (YEAR), INSTITUTION_NAME, COUNTRY.
   
An example of such a line is given here:
RL   THESIS (1972), GEORGE WASHINGTON UNIVERSITY, U.S.A.
f) Patent applications
For patent applications the format of the RL line is:
RL   PATENT NUMBER PAT_NUMB, DD-MMM-YYYY.
Where `PAT_NUMB' is the international publication number of the patent,
`DD' is the day, `MMM' is the month and `YEAR' is the year.
g) Submissions
The final  form that  an RL line can take is that used for submissions.
The format of such a RL line is:
RL   SUBMITTED (MMM-YEAR) TO DATABASE_NAME.
Where `MMM' is the month, `YEAR' is the year and `DATABASE_NAME' is one
of the following:
     EMBL/GENBANK/DDBJ DATA BANKS
     THE SWISS-PROT DATA BANK
     THE ECOSEQ DATA BANK
     THE HIV DATA BANK
     THE MIM DATA BANK
     THE NEWAT DATA BANK
     THE PDB DATA BANK
     THE PIR DATA BANK
Two examples of submission RL lines are given here:
RL   SUBMITTED (APR-1994) TO EMBL/GENBANK/DDBJ DATA BANKS.
RL   SUBMITTED (FEB-1995) TO THE SWISS-PROT DATA BANK.
3.11 The DR line
     3.11.1 Definition
The DR  (Database  cross-Reference)  lines  are  used  as  pointers  to
information related to SWISS-PROT entries and found in data collections
other than SWISS-PROT.
For example,  if the  X-ray crystallographic  atomic coordinates  of  a
sequence are  stored in  the Brookhaven  Protein Data  Bank (PDB) there
will be DR line(s) pointing to the corresponding entri(es) in that data
bank. For a sequence translated from a nucleotide sequence there can be
DR lines  pointing to  entries in  the EMBL or Genbank data banks which
correspond to  the DNA or RNA sequence(s) from which it was translated.
   
The format of the DR line is:
DR   DATA_BANK_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER.
     3.11.2 Data bank identifier
The first  item on  the DR  line, the  data  bank  identifier,  is  the
abbreviated name of the data collection to which reference is made. The
currently defined data bank identifiers are the following:
EMBL                Nucleotide sequence database of EMBL (EBI)
DICTYDB             Dictyostelium discoideum genome database
ECO2DBASE           Escherichia coli gene-protein database (2D gel
                        spots) (ECO2DBASE)
ECOGENE             Escherichia coli K12 genome database (EcoGene)
FLYBASE             Drosophila genome database (FlyBase)
GCRDB               G-protein--coupled receptor database (GCRDb)
HIV                 HIV sequence database
HSSP                Homology-derived secondary structure of proteins
                        database (HSSP).
LISTA               Yeast   (Saccharomyces    cerevisiae)   genome
                        database (LISTA)
MAIZEDB             Maize genome database (MaizeDB)
MIM                 Mendelian Inheritance in Man Database (MIM)
PDB                 Brookhaven Protein Data Bank (PDB)
PIR                 Protein sequence database of the Protein
                        Information Resource (PIR)
PROSITE             PROSITE dictionary of sites and patterns in
                        proteins
REBASE              Restriction enzyme database (REBASE)
AARHUS/GHENT-2DPAGE Human keratinocyte 2D gel protein database from
                        Aarhus and Ghent universities
SGD                 Saccharomyces Genome Database (SGD)
STYGENE             Salmonella typhimurium LT2 genome database
                        (StyGene)
SUBTILIST           Bacillus subtilis 168 genome database (SubtiList)
SWISS-2DPAGE        Human 2D  Gel Protein  Database from the University
                        of Geneva (SWISS-2DPAGE)
TRANSFAC            Transcription factor database (Transfac)
WORMPEP             Caenorhabditis elegans genome sequencing project
                        protein database (Wormpep)
YEPD                Yeast electrophoresis protein database (YEPD)
     3.11.3 The primary identifier
The second  item  on  the  DR  line,  the  primary  identifier,  is  an
unambiguous pointer  to the information entry in the data bank to which
reference is being made.
-  For an  EMBL, DictyDb,  EcoGene, FlyBase,  GCRDb, HIV,  LISTA,  PIR,
   PROSITE, SGD, StyGene, SubtiList, SWISS-2DPAGE or Transfac reference
   the primary identifier is the  first accession  number (also  called
   the Unique Identifier in some data  banks) of  the  entry  to  which
   reference is being made.
-  For a  MIM reference the primary identifier is the catalog number of
   the disease (or phenotype).
-  For a  PDB or  REBASE reference  the primary identifier is the entry
   name.
-  For an  AARHUS/GHENT-2DPAGE, ECO2DBASE or YEPD reference the primary
   identifier is the protein spot alphanumeric designation.
-  For a WormPep reference the primary identifier is the cosmid-derived
   name given  to that  protein  by  the  C.elegans  genome  sequencing
   project.
-  For a MaizeDB reference the primary identifier is the "Gene-product"
   accession ID.
-  For a  HSSP reference the primary identifier is the accession number
   of  a  SWISS-PROT  entry  cross-referenced  to  a  PDB  entry  whose
   structure is  expected to  be similar  to that of the entry in which
   the HSSP cross-reference is present.
     3.11.4 The secondary identifier
The third  and last  item on  the DR line, the secondary identifier, is
used to complement the information given by the first identifier.
-  For an  EMBL, GenBank,  HIV, PIR  or PROSITE reference the secondary
   identifier is the entry's name.
-  For a PDB reference the secondary identifier is the most recent date
   on which PDB revised the entry (last `REVDAT' record).
-  For  a  DictyDb,  EcoGene, FlyBase, LISTA, SGD, StyGene or SubtiList
   reference the  secondary identifier  is the gene designation. If the
   gene designation is not available a dash "-" is used.
   For a  MIM, REBASE,  or ECO2DBASE reference the secondary identifier
   is the  latest release  number or  edition of  the database that has
   been used to derive the cross-reference.
-  For a SWISS-2DPAGE reference the secondary identifier is the species
   of origin.
-  For an  AARHUS/GHENT-2DPAGE reference  the secondary  identifier  is
   either `IEF'  (for  isoelectric  focusing)  or  `NEPHGE'  (for  non-
   equilibrium pH gradient electrophoresis).
-  For a  WormPep  reference  the  secondary  identifier  is  a  number
   attributed by  the  C.elegans  genome  sequencing  project  to  that
   protein.
-  For a  GCRDb, MaizeDB,  Transfac or  YEPD  reference  the  secondary
   identifier is not defined and a dash "-" is stored in that field.
-  For a HSSP reference the secondary identifier is  the  entry name of
   the PDB structure  related  to  that  of the entry in which the HSSP
   cross-reference is present.
Examples of complete DR lines are shown here:
DR   AARHUS/GHENT-2DPAGE; 8006; IEF.
DR   DICTYDB; DD01047; MYOA.
DR   EMBL; X01704; GMNOD23.
DR   ECO2DBASE; G052.0; 6TH EDITION.
DR   ECOGENE; EG10054; ARAC.
DR   FLYBASE; FBGN0000055; ADH.
DR   GCRDB; GCR_0087; -.
DR   HIV; K02013; NEF$BRU.
DR   HSSP; P00438; 1DOB.
DR   LISTA; SC00018; ACT1.
DR   MAIZEDB; 25342; -.
DR   MIM; 249900; 11TH EDITION.
DR   PDB; 3ADK; 16-APR-88.
DR   PIR; A02768; R5EC7.
DR   PROSITE; PS00021; KRINGLE.
DR   REBASE; BSURI; RELEASE 9410.
DR   SGD; L0000008; AAR2.
DR   STYGENE; SG10312; PROV.
DR   SUBTILIST; BG10774; OPPD.
DR   SWISS-2DPAGE; P10599; HUMAN.
DR   TRANSFAC; T00141; -.
DR   WORMPEP; ZK637.7; CE00437.
DR   YEPD; 4270; -.
3.12 The FT line
The FT (Feature Table) lines provide a precise but simple means for the
annotation of  the sequence  data. The table describes regions or sites
of interest  in the  sequence. In general the feature table lists post-
translational modifications,  binding sites, enzyme active sites, local
secondary structure  or other  characteristics reported  in  the  cited
references. Sequence  conflicts between references are also included in
the feature table. The feature table is updated when more becomes known
about a given sequence.
The FT  lines have a fixed format. The column numbers allocated to each
of the  data items within each FT line are shown in the following table
(column numbers  not referred  to in  the table  are always occupied by
blanks):
   +---------------+-----------------------+
   |     Columns   |   Data item           |
   +---------------+-----------------------+
   |       1- 2    |   FT                  |
   |       6-13    |   Key name            |
   |      15-20    |   `FROM' endpoint     |
   |      22-27    |   `TO' endpoint       |
   |      35-75    |   Description         |
   +---------------+-----------------------+
The key  name and  the endpoints  are always  on a single line, but the
description may  require continuation.  For this purpose, the next line
contains blanks in the key, the `FROM', and the `TO' columns positions,
and the  description is  continued in its normal position. Thus a blank
key always denotes a continuation of the previous description.
An example of a feature table is shown below:
FT   NON_TER       1      1
FT   PEPTIDE       1      9       ARG-VASOPRESSIN.
FT   PEPTIDE      13    107       NEUROPHYSIN 2.
FT   PEPTIDE     109    147       COPEPTIN.
FT   DISULFID      1      6
FT   MOD_RES       9      9       AMIDATION (ACTIVE ARG-VASOPRESSIN).
FT   CONFLICT    102    102       D -> S (IN REF. 2).
FT   CONFLICT    105    105       MISSING (IN REF. 3).
FT   CARBOHYD    114    114
The first  item on  each FT  line is  the key  name, which  is a  fixed
abbreviation (up to 8 characters) with a defined meaning. A list of the
currently defined  key names  can  be  found  in  Appendix  A  of  this
document.
Following the key name are the `FROM' and `TO' endpoint specifications.
These fields designate (inclusively) the endpoints of the feature named
in the  key field.  In general,  these fields  simply  contain  residue
numbers indicating positions in the sequence as listed. Note that these
positions are  always specified  assuming a  numbering  of  the  listed
sequence from  1 to  n; this  numbering is  not necessarily the same as
that used  in the  original reference(s). The following should be noted
in interpreting these endpoints:
-  If the  `FROM'  and  `TO'  specifications  are  equal,  the  feature
   indicated consists of the single amino acid at that position.
-  When a feature is known to extend beyond the end(s) of the sequenced
   region, the  endpoint  specification  will  be  preceded  by  <  for
   features which continue to the left end (N-terminal direction) or by
   >  for   features  which  continue  to  the  right  end  (C-terminal
   direction).
-  Unknown endpoints are denoted by `?'.
See also the notes concerning each of the key names in the appendix A.
The remaining  portion of  the FT  line is a description which contains
additional information  about the  feature. For  example, for a residue
post-translational modification  (key MOD_RES)  the chemical  nature of
that modification  is  given,  while  for  a  sequence  variation  (key
VARIANT) the  nature of the variation is indicated. This portion of the
line is  generally in  free form,  and may  be continued  on additional
lines when necessary.
3.13 The SQ line
The SQ  (SeQuence header) line marks the beginning of the sequence data
and gives a quick summary of its content. The format of the SQ line is:
SQ   SEQUENCE  XXXX AA; XXXXX MW;  XXXXX CN;
The line  contains the  length of  the  sequence  in  amino-acids  (AA)
followed by the molecular weight (MW) rounded to the nearest gram and a
checking number (CN) as defined in the following reference:
Bairoch A.
Biochem. J. 203:527-528(1982).
An example of an SQ line is shown here:
SQ   SEQUENCE 104 AA; 11530 MW; 54319 CN;
The information  in the  SQ line  can be used as a check on accuracy or
for statistical  purposes. The  word `SEQUENCE'  is present  solely for
readability.
3.14 The sequence data line
The sequence  data line has a line code consisting of two blanks rather
than the two-letter codes used up until now. The sequence is written 60
amino acids  per line,  in groups  of  10  amino  acids,  beginning  in
position 6 of the line.
The characters  used for  the amino  acids are  the standard  IUPAC one
letter codes (see Appendix B).
An example of sequence data lines is shown here:
     GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA ANKNKGIIWG
     EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ATNE
3.15 The CC line
The CC  lines are  free text  comments on the entry, and may be used to
convey any  useful information.  The comments  always appears below the
last reference line and are grouped together in comment blocks, a block
being made  of 1 or more comment lines. The first line of a block start
is marked with the characters `-!-'.
The format of a comment block is:
CC   -!- FIRST LINE OF A COMMENT BLOCK.
CC       SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK.
A major proportion of the comment blocks are arranged according to what
we designate  as 'topics`.  The format of a comment block which belongs
to a 'topic` is:
CC    -!- TOPIC: FREE TEXT DESCRIPTION.
The current topics and their definition are:
ALTERNATIVE PRODUCTS     Description  of   the  existence   of  related
                   protein sequence(s) produced by alternative splicing
                   of the  same gene(s)  or by  the use  of alternative
                   initiation codons.
CATALYTIC ACTIVITY Description  of  the  reaction(s)  catalysed  by  an
                   enzyme [*].
CAUTION            This topic  warns you  about possible  errors and/or
                   grounds for confusion.
COFACTOR           Description of an enzyme cofactor.
DEVELOPMENTAL STAGE Description   of    the   developmental    specific
                   expression of a protein.
DISEASE            Description of  the  disease(s)  associated  with  a
                   deficiency of a protein.
DOMAIN             Description of the domain structure of a protein.
ENZYME REGULATION  Description of an enzyme regulatory mechanism.
FUNCTION           General description of the function(s) of a protein.
INDUCTION          Description of  the compound(s)  which stimulate the
                   synthesis of a protein.
PATHWAY            Description of  the metabolic pathway(s) to which is
                   associated a protein.
POLYMORPHISM       Description of polymorphism(s).
PTM                Description of a post-translational modification.
SIMILARITY         Description  of   the  similariti(es)  (sequence  or
                   structural) of a protein with other proteins.
SUBCELLULAR LOCATION     Description of  the subcellular  location of a
                   mature protein product.
SUBUNIT            Description  of   the  quaternary   structure  of  a
                   protein.
TISSUE SPECIFICITY Description of the tissue specificity of a protein.
We show here, for each of the topics defined above, two examples of its
usage:
CC   -!- ALTERNATIVE PRODUCTS: SKELETAL MUSCLE AND FIBROBLAST
CC       TROPOMYOSINS ARE OBTAINED BY ALTERNATIVE MRNA SPLICING.
CC   -!- ALTERNATIVE PRODUCTS: USING ALTERNATIVE INITIATION CODONS IN
CC       THE SAME READING FRAME, THE GENE TRANSLATES INTO THREE
CC       ISOZYMES: ALPHA, BETA AND BETA'.
CC   -!- CATALYTIC ACTIVITY: ATP + L-GLUTAMATE + NH(3) = ADP +
CC       GLUTAMINE + ORTHOPHOSPHATE.
CC   -!- CATALYTIC ACTIVITY: (R)-2,3-DIHYDROXY-3-METHYLBUTANOATE +
CC       NADP(+) = (S)-2-HYDROXY-2-METHYL-3-OXOBUTANOATE + NADPH.
CC   -!- CAUTION: REF.2 SEQUENCE DIFFERS FROM THAT SHOWN IN POSITIONS
CC       92 TO 165 DUE TO A FRAMESHIFT.
CC   -!- CAUTION: IT IS UNCERTAIN WHETHER MET-1 OR MET-3 IS THE
CC       INITIATOR.
CC   -!- COFACTOR: PYRIDOXAL PHOSPHATE.
CC   -!- COFACTOR: FAD FLAVOPROTEIN AND NONHEME IRON.
CC   -!- DEVELOPMENTAL STAGE: EXPRESSED EARLY DURING CONIDIAL (DORMANT
CC       SPORES) DIFFERENTIATION.
CC   -!- DEVELOPMENTAL STAGE: EXPRESSED IN EMBRYONIC AND EARLY LARVAL
CC       STAGES.
CC   -!- DISEASE: DEFECTS IN PHKA1 ARE LINKED TO X-LINKED MUSCLE
CC       GLYCOGENOSIS, A DISEASE CHARACTERIZED BY SLOWLY PROGRESSIVE,
CC       PREDOMINANTLY DISTAL MUSCLE WEAKNESS AND ATROPHY.
CC   -!- DISEASE: DEFECTS IN ALD ARE THE CAUSE OF X-LINKED
CC       ADRENOLEUKODYSTROPHY, A PEROXISOMAL DISORDER CHARACTERIZED BY
CC       PROGRESSIVE DEMYLEINATION OF THE CNS AND ADRENAL
CC       INSUFFICIENCY.
CC   -!- DOMAIN: CONTAINS A COILED-COIL DOMAIN ESSENTIAL FOR VESICULAR
CC       TRANSPORT AND A DISPENSABLE C-TERMINAL REGION.
CC   -!- DOMAIN: THE B CHAIN IS COMPOSED OF TWO DOMAINS, EACH DOMAIN
CC       CONSISTS OF 3 HOMOLOGOUS SUBDOMAINS (ALPHA, BETA, GAMMA).
CC   -!- ENZYME REGULATION: THE ACTIVITY OF THIS ENZYME IS CONTROLLED
CC       BY ADENYLATION. THE FULLY ADENYLATED ENZYME IS INACTIVE.
CC   -!- ENZYME REGULATION: ACTIVATED BY GRAM-NEGATIVE BACTERIAL
CC       LIPOPOLYSACCHARIDES AND CHYMOTRYPSIN.
CC   -!- FUNCTION: PROFILIN PREVENTS THE POLYMERIZATION OF ACTIN.
CC   -!- FUNCTION: INHIBITOR OF FUNGAL POLYGALACTURONASE. IT IS AN
CC       IMPORTANT FACTOR FOR PLANT RESISTANCE TO PHYTOPATHOGENIC
CC       FUNGI.
CC   -!- INDUCTION: BY SALT STRESS AND BY ABSCISIC ACID (ABA).
CC   -!- INDUCTION: BY INFECTION, PLANT WOUNDING, OR ELICITOR
CC       TREATEMENT OF CELL CULTURES.
CC   -!- PATHWAY: FIRST STEP IN PROLINE BIOSYNTHESIS PATHWAY.
CC   -!- PATHWAY: LAST STEP IN PROTOHEME BIOSYNTHESIS. IN ERYTHROID
CC       CELLS, FERROCHELATASE APPEARS TO BE THE RATE-LIMITING ENZYME.
CC   -!- POLYMORPHISM: THE ALLELIC FORM OF THE ENZYME WITH GLN-191
CC       HYDROLYZES PARAOXON WITH A LOW TURNOVER NUMBER AND THE ONE
CC       WITH ARG-191 WITH A HIGH TURNOVER NUMBER.
CC   -!- POLYMORPHISM: THE TWO MAIN ALLELES OF HP ARE CALLED HP1F
CC       (FAST) AND HP1S (SLOW). THE SEQUENCE SHOWN HERE IS THAT OF THE
CC       HP1S FORM.
CC   -!- PTM: O-GLYCOSYLATED; AN UNUSUAL FEATURE AMONG VIRAL
CC       GLYCOPROTEINS.
CC   -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY
CC       PROTEOLYTIC PROCESSING.
CC   -!- SIMILARITY: BELONGS TO THE SUBTILASE PROTEASES FAMILY. STRONG
CC       SIMILARITY WITH OTHER FURIN-LIKE ENZYMES.
CC   -!- SIMILARITY: BELONGS TO THE ATP-BINDING TRANSPORT PROTEIN
CC       FAMILY (ABC TRANSPORTERS). BELONGS TO THE MDR SUBFAMILY.
CC   -!- SUBCELLULAR LOCATION: MITOCHONDRIAL MATRIX.
CC   -!- SUBCELLULAR LOCATION: INTEGRAL MEMBRANE PROTEIN. INNER
CC       MEMBRANE.
CC   -!- SUBUNIT: HOMOTETRAMER.
CC   -!- SUBUNIT: HETERODIMER OF A LIGHT CHAIN AND A HEAVY CHAIN LINKED
CC       BY A DISULFIDE BOND.
CC   -!- TISSUE SPECIFICITY: KIDNEY, SUBMAXILLARY GLAND, AND URINE.
CC   -!- TISSUE SPECIFICITY: SHOOTS, ROOTS, AND COTYLEDON FROM
CC       DEHYDRATING SEEDLINGS.
[*]    Whenever it was possible we have used, to describe the catalytic
       activity of  an enzyme,  the recommendations of the Nomenclature
       Committee  of   the  International  Union  of  Biochemistry  and
       Molecular Biology (IUBMB) as published in:
       Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York, (1992).
3.16 The // line
The //  (terminator) line  contains no  data or comments. It designates
the end of an entry.
                 APPENDIX A: FEATURE TABLE KEYS
The definition  of each  of the  key names used in the feature table is
explained here. It is probable that new key names will be progressively
be added to this list. For each key a number of examples are presented.
A.1 Change indicators
CONFLICT - Different papers report differing sequences.
Examples of CONFLICT key feature lines:
FT   CONFLICT     33     33       MISSING (IN REF. 2).
FT   CONFLICT     60     60       P -> A (IN REF. 3 AND 4).
FT   CONFLICT     81     84       ASTQ -> GWT (IN REF. 3).
VARIANT - Authors report that sequence variants exist.
Examples of VARIANT key feature lines:
FT   VARIANT       3      3       V -> I.
FT   VARIANT      87     87       L -> T (IN STRAIN 2.3.1).
FT   VARIANT       1      2       MISSING (IN 25% OF THE CHAINS).
VARSPLIC -  Description of  sequence variants  produced by  alternative
splicing.
Examples of VARSPLIC key feature lines:
FT   VARSPLIC    194    196       GRP -> DVR (IN SHORT FORM).
FT   VARSPLIC    197    211       MISSING (IN SHORT FORM).
MUTAGEN - Site which has been experimentally altered.
Examples of MUTAGEN key feature lines:
FT   MUTAGEN      65     65       H->F: 100% LOSS OF ACTIVITY.
FT   MUTAGEN     123    123       G->R,L,M: DNA BINDING LOST.
A.2 Amino acid modifications
MOD_RES  - Post-translational modification of a residue.
The chemical  nature of  the modification  is given in the description.
The general format of the MOD_RES description field is:
FT   MOD_RES     xxx    xxx       MODIFICATION (COMMENT).
The most frequently occuring modifications are the following:
ACETYLATION                    N-terminal or other.
AMIDATION                      Generally at  the C-terminal of a mature
                                   active peptide.
BLOCKED                        Undetermined N-  or C-terminal  blocking
                                   group.
FORMYLATION                    Of the N-terminal methionine.
GAMMA-CARBOXYGLUTAMIC ACID
HYDROXYLATION                  Of asparagine, aspartic acid, proline or
                                   lysine.
METHYLATION                    Generally of lysine or arginine.
PHOSPHORYLATION                Of serine, threonine, tyrosine, aspartic
                                   acid or histidine.
PYRROLIDONE CARBOXYLIC ACID    N-terminal glutamate which has formed an
                                   internal cyclic lactam.
SULFATATION                    Generally of tyrosine.
Examples of MOD_RES key feature lines:
FT   MOD_RES       1      1       ACETYLATION.
FT   MOD_RES      11     11       PHOSPHORYLATION (BY PKC).
FT   MOD_RES       2      2       SULFATATION (BY SIMILARITY).
FT   MOD_RES       8      8       AMIDATION (G-9 PROVIDE AMIDE GROUP).
FT   MOD_RES       9      9       METHYLATION (MONO-, DI- & TRI-).
LIPID  - Covalent binding of a lipidic moiety
The chemical  nature  of  the  bound  lipid  moiety  is  given  in  the
description. The general format of the LIPID description field is:
FT   LIPID       xxx    xxx       MODIFICATION (COMMENT).
The modifications which are currently defined are the following:
MYRISTATE          Myristate group  attached through  an amide  bond to
                   the N-terminal glycine residue of the mature form of
                   a protein [1,2] or to an internal lysine residue.
PALMITATE          Palmitate group attached through a thioether bond to
                   a cysteine  residue or  through an  ester bond  to a
                   serine or threonine residue [1,2].
FARNESYL           Farnesyl group  attached through a thioether bond to
                   a cysteine residue [3,4].
GERANYL-GERANYL    Geranyl-geranyl group  attached through  a thioether
                   bond to a cysteine residue [3,4].
GPI-ANCHOR         Glycosyl-phosphatidylinositol (GPI)  group linked to
                   the alpha-carboxyl  group of  the C-terminal residue
                   of the mature form of a protein [5,6].
N-ACYL DIGLYCERIDE N-terminal  cysteine   of  the   mature  form  of  a
                   prokaryotic lipoprotein  with an  amide-linked fatty
                   acid and  a glyceryl  group to which two fatty acids
                   are linked by ester linkages [7].
-       [1] Grand R.J.A.
 Biochem. J. 258:626-638(1989).
 [2] McLhinney R.A.J.
 Trends Biochem. Sci. 15:387-391(1990).
 [3] Glomset J.A., Gelb M.H., Farnsworth C.C.
 Trends Biochem. Sci. 15:139-142(1990).
 [4] Sinensky M., Lutz R.J.
 BioEssays 14:25-31(1992).
 [5] Low M.G.
 FASEB J. 3:1600-1608(1989).
 [6] Low M.G.
 Biochim. Biophys. Acta 988:427-454(1989).
 [7] Hayashi S., Wu H.C.
 J. Bioenerg. Biomembr. 22:451-471(1990).
 
Examples of LIPID key feature lines:
FT   LIPID         1      1       MYRISTATE.
FT   LIPID        65     65       PALMITATE (BY SIMILARITY).
FT   LIPID       354    354       GPI-ANCHOR.
DISULFID - Disulfide bond.
The `FROM'  and `TO'  endpoints represent  the two  residues which  are
linked by  an intra-chain  disulfide  bond.  If  the  `FROM'  and  `TO'
endpoints are  identical, the  disulfide bond  is an interchain one and
the description  field indicates the nature of the cross-link. Examples
of DISULFID key feature lines:
FT   DISULFID     27     44       PROBABLE.
FT   DISULFID     14     14       INTERCHAIN (WITH A LIGHT CHAIN).
THIOLEST - Thiolester bond.
The `FROM'  and `TO'  endpoints represent  the two  residues which  are
linked by the thiolester bond.
THIOETH - Thioether bond.
The `FROM'  and `TO'  endpoints represent  the two  residues which  are
linked by the thioether bond.
CARBOHYD - Glycosylation site.
The nature  of the  carbohydrate (if known) is given in the description
field. Examples of CARBOHYD key feature lines:
FT   CARBOHYD    103    103       GLUCOSYLGALACTOSE.
FT   CARBOHYD    256    256       POTENTIAL.
METAL - Binding site for a metal ion.
The description  field indicates  the nature  of the metal. Examples of
METAL key feature lines:
FT   METAL        18     18       IRON (HEME AXIAL LIGAND).
FT   METAL        87     87       COPPER (POTENTIAL).
BINDING -  Binding site  for any  chemical group (co-enzyme, prosthetic
group, etc.).
The chemical  nature of  the group  is given  in the description field.
Examples of BINDING key feature lines:
FT   BINDING      14     14       HEME (COVALENT).
FT   BINDING     250    250       PYRIDOXAL PHOSPHATE.
A.3 Regions
SIGNAL - Extent of a signal sequence (prepeptide).
TRANSIT - Extent of a transit peptide (mitochondrial, chloroplastic, or
for a microbody).
Examples of TRANSIT key feature lines:
FT   TRANSIT       1     42       CHLOROPLAST.
FT   TRANSIT       1     25       MITOCHONDRION.
FT   TRANSIT       1     23       MICROBODY (POTENTIAL).
PROPEP - Extent of a propeptide.
Examples of PROPEP key feature lines:
FT   PROPEP       27     28       ACTIVATION PEPTIDE.
FT   PROPEP      550    574       REMOVED IN MATURE FORM.
CHAIN - Extent of a polypeptide chain in the mature protein.
Examples of CHAIN key feature lines:
FT   CHAIN        21    119       BETA-2 MICROGLOBULIN.
FT   CHAIN        37    >42       FACTOR XIIIA.
PEPTIDE - Extent of a released active peptide.
Examples of PEPTIDE key feature lines:
FT   PEPTIDE      13    107       NEUROPHYSIN 2.
FT   PEPTIDE     235    239       MET-ENKEPHALIN.
DOMAIN - Extent of a domain of interest on the sequence.
The nature  of that  domain is given in the description field. Examples
of DOMAIN key feature lines:
FT   DOMAIN       22    788       EXTRACELLULAR (POTENTIAL).
FT   DOMAIN      140    152       ANCESTRAL CALCIUM SITE.
CA_BIND - Extent of a calcium-binding region.
DNA_BIND - Extent of a DNA-binding region.
NP_BIND - Extent of a nucleotide phosphate binding region.
The nature  of the nucleotide phosphate is indicated in the description
field. Examples of NP_BIND key feature lines:
FT   NP_BIND      13     25       ATP.
FT   NP_BIND      45     49       GTP (POTENTIAL).
FT   NP_BIND       8     34       FAD (ADP PART).
TRANSMEM - Extent of a transmembrane region.
ZN_FING - Extent of a zinc finger region.
Examples of ZN_FING key feature lines:
FT   ZN_FING     110    134       GATA-TYPE.
FT   ZN_FING     559    579       C4-TYPE.
SIMILAR - Extent of a similarity with another protein sequence.
Precise  information,  relative  to  that  sequence  is  given  in  the
description field. Examples of SIMILAR key feature lines:
FT   SIMILAR     351    456       STRONG, WITH KAPPA CHAIN V REGIONS.
FT   SIMILAR     580   1182       HIGH, WITH ERBB TRANSFORMING PROTEIN.
REPEAT - Extent of an internal sequence repetition.
Examples of REPEATS key feature lines:
FT   REPEAT       75    300       APPROXIMATE.
FT   REPEAT      390    600       APPROXIMATE.
A.4 Secondary structure
The feature  table of  sequence  entries  of  proteins  whose  tertiary
structure is  known experimentally  contains  the  secondary  structure
information corresponding  to that  protein.  The  secondary  structure
assignment is  made according  to  DSSP  (see  Kabsch  W.,  Sander  C.;
Biopolymers, 22:2577-2637(1983))  and the information is extracted from
the coordinate data sets of the Protein Data Bank (PDB).
In the  feature table  only three  types  of  secondary  structure  are
specified :  helices (key  HELIX), beta-strand  (key STRAND)  and turns
(key TURN).  Residues not  specified in  one of  these classes are in a
`loop' or  `random-coil' structure).  Because the  DSSP assignment  has
more than  the  three  common  secondary  structure  classes,  we  have
converted the following DSSP assignments to HELIX, STRAND, and TURN:
 
        | DSSP code | DSSP definition | SWISS-PROT assignment | 
        | H | Alpha-helix | HELIX | 
	| G | 3(10) helix | HELIX | 
	| I | Pi-helix | HELIX | 
	| E | Hydrogen bonded beta-strand (extended strand)
 | STRAND | 
	| B | Residue in an isolated beta-bridge
 | STRAND | 
	| T | H-bonded turn (3-turn, 4-turn or 5-turn)
 | TURN | 
	| S | Bend (five-residue bend centered at residue i)
 | Not specified | 
One should be aware of the following facts:
-   Segment Length.  For helices  (alpha and  3-10),  the  residue  just
 before and just after the helix as given by DSSP participates in the
 helical hydrogen  bonding pattern  with a  single H-bond.  For  some
 practical purposes,  one can therefore extend the HELIX range by one
 residue on each side. E.g. HELIX 25-35 instead of HELIX 26-34. Also,
 the ends  of secondary  structure segments are less well defined for
 lower resolution  structures. A  fluctuation of  +/- one  residue is
 common.
 
 
- Missing segments. In low resolution structures, badly formed helices
 or strands may be omitted in the DSSP definition.
 
 
- Special helices  and strands.  Helices  of  length  three  are  3-10
 helices, those of length four and longer are either alpha-helices or
 3-10 helices (pi helices are extremely rare). A strand of length one
 corresponds to  a residue  in an  isolated beta-bridge. Such bridges
 can be structurally important.
 
 
- Missing secondary  structure. No  secondary structure  is  currently
 given in the feature table in the following cases:
 
 
-    No sequence data in the PDB entry.
 
-    Structure for which only C-alpha coordinates are in PDB.
 
-    NMR structure with more than one coordinate data set.
 
-    Model (i.e. theoretical) structure.
 
 
Examples:
FT   HELIX         3     14
FT   TURN         15     15
FT   TURN         20     21
FT   STRAND       23     23
FT   HELIX        25     35
A.5 Others
ACT_SITE - Amino acid(s) involved in the activity of an enzyme.
Examples of ACT_SITE key feature lines:
FT   ACT_SITE    193    193       ACCEPTS A PROTON DURING CATALYSIS.
FT   ACT_SITE     99     99       CHARGE RELAY SYSTEM.
SITE - Any other interesting site on the sequence.
Examples of SITE key feature lines:
FT   SITE        285    288       PREVENT SECRETION FROM ER.
FT   SITE        241    242       CLEAVAGE (BY ANIMAL COLLAGENASES).
INIT_MET - The sequence is known to start with an initiator methionine.
This feature  key is  mostly associated with a zero value in the `FROM'
and `TO' fields.
FT   INIT_MET      0      0
NON_TER -  The residue  at an  extremity of  the sequence  is  not  the
terminal residue.
If applied to position 1, this signifies that the first position is not
the N-terminus  of the  complete  molecule.  If  applied  to  the  last
position, it signifies  that this position is not the C-terminus of the
complete molecule. There is no description field for this key. Examples
of NON_TER key feature lines:
FT   NON_TER       1      1
FT   NON_TER     150    150
NON_CONS - Non consecutive residues.
Indicates that  two residues in a sequence are not consecutive and that
there are  a number  of unsequenced  residues between them. Examples of
NON_CONS key feature lines:
FT   NON_CONS   1036   1037
FT   NON_CONS     33     34       N-TERMINAL / C-TERMINAL.
UNSURE - Uncertainties in the sequence
Used to  describe region(s)  of a  sequence for  which the  authors are
unsure about the sequence assignment.