2. CONVENTIONS USED IN THE DATA BANK



The following sections describes the general conventions used in SWISS-
PROT to achieve uniformity of presentation. Experienced users of the
EMBL Database can skip these sections and directly refer to Appendix C,
which lists the minor differences in format between the two data
collections.

2.1 General structure of the data bank

The SWISS-PROT protein sequence data bank is composed of sequence
entries. Each entry corresponds to a single contiguous sequence as
contributed to the bank or reported in the literature. In some cases,
entries have been assembled from several papers that report overlapping
sequence regions. Conversely, a single paper can provide data for
several entries, e.g. when related sequences from different organisms
are reported.

References to positions within a sequence are made using sequential
numbering, beginning with 1 at the N-terminal end of the sequence.

Except for initiator N-terminal methionine residues, which are not
included in a sequence when their absence from the mature sequence has
been proven, the sequence data correspond to the precursor form of a
protein before post-translational modifications and processing.



2.2 Classes of data

In order to attempt to make data available to users as quickly as
possible after publication, SWISS-PROT entries may be released before
all their details are finalized. The concept of data classes gives the
user some idea of the areas in which the data still require further
work. The class of each entry is indicated on the first (ID) line of
the entry. At present two classes are supported:

STANDARD : Data which are complete to the standards laid down by
the SWISS-PROT data bank.

PRELIMINARY: Data for which only the sequence and bibliographic
information have been submitted to thorough checks.


2.3 Structure of a sequence entry

The entries in the SWISS-PROT data bank are structured so as to be
usable by human readers as well as by computer programs. The
explanations, descriptions, classifications and other comments are in
ordinary English. Wherever possible, symbols familiar to biochemists,
protein chemists and molecular biologists are used.

Each sequence entry is composed of lines. Different types of lines,
each with their own format, are used to record the various data which
make up the entry. A sample sequence entry is shown in the next three
pages.

ID TNFA_HUMAN STANDARD; PRT; 233 AA.
AC P01375;
DT 21-JUL-1986 (REL. 01, CREATED)
DT 21-JUL-1986 (REL. 01, LAST SEQUENCE UPDATE)
DT 01-FEB-1995 (REL. 31, LAST ANNOTATION UPDATE)
DE TUMOR NECROSIS FACTOR PRECURSOR (TNF-ALPHA) (CACHECTIN).
GN TNFA.
OS HOMO SAPIENS (HUMAN).
OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC EUTHERIA; PRIMATES.
RN [1]
RP SEQUENCE FROM N.A.
RX MEDLINE; 87217060.
RA NEDOSPASOV S.A., SHAKHOV A.N., TURETSKAYA R.L., METT V.A.,
RA AZIZOV M.M., GEORGIEV G.P., KOROBKO V.G., DOBRYNIN V.N.,
RA FILIPPOV S.A., BYSTROV N.S., BOLDYREVA E.F., CHUVPILO S.A.,
RA CHUMAKOV A.M., SHINGAROVA L.N., OVCHINNIKOV Y.A.;
RL COLD SPRING HARB. SYMP. QUANT. BIOL. 51:611-624(1986).
RN [2]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85086244.
RA PENNICA D., NEDWIN G.E., HAYFLICK J.S., SEEBURG P.H., DERYNCK R.,
RA PALLADINO M.A., KOHR W.J., AGGARWAL B.B., GOEDDEL D.V.;
RL NATURE 312:724-729(1984).
RN [3]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85137898.
RA SHIRAI T., YAMAGUCHI H., ITO H., TODD C.W., WALLACE R.B.;
RL NATURE 313:803-806(1985).
RN [4]
RP SEQUENCE FROM N.A.
RX MEDLINE; 86016093.
RA NEDWIN G.E., NAYLOR S.L., SAKAGUCHI A.Y., SMITH D.H.,
RA JARRETT-NEDWIN J., PENNICA D., GOEDDEL D.V., GRAY P.W.;
RL NUCLEIC ACIDS RES. 13:6361-6373(1985).
RN [5]
RP SEQUENCE FROM N.A.
RX MEDLINE; 85142190.
RA WANG A.M., CREASEY A.A., LADNER M.B., LIN L.S., STRICKLER J.,
RA VAN ARSDELL J.N., YAMAMOTO R., MARK D.F.;
RL SCIENCE 228:149-154(1985).
RN [6]
RP X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS).
RX MEDLINE; 90008932.
RA ECK M.J., SPRANG S.R.;
RL J. BIOL. CHEM. 264:17595-17605(1989).
RN [7]
RP X-RAY CRYSTALLOGRAPHY (2.9 ANGSTROMS).
RX MEDLINE; 91193276.
RA JONES E.Y., STUART D.I., WALKER N.P.;
RL J. CELL SCI. SUPPL. 13:11-18(1990).
RN [8]
RP X-RAY CRYSTALLOGRAPHY (2.6 ANGSTROMS).
RX MEDLINE; 90008932.
RA ECK M.J., SPRANG S.R.;
RL J. BIOL. CHEM. 264:17595-17605(1989).
RN [9]
RP MUTAGENESIS.
RX MEDLINE; 91184128.
RA OSTADE X.V., TAVERNIER J., PRANGE T., FIERS W.;
RL EMBO J. 10:827-836(1991).
RN [10]
RP MYRISTOYLATION.
RX MEDLINE; 93018820.
RA STEVENSON F.T., BURSTEN S.L., LOCKSLEY R.M., LOVETT D.H.;
RL J. EXP. MED. 176:1053-1062(1992).
CC -!- FUNCTION: CYTOKINE WITH A WIDE VARIETY OF FUNCTIONS: IT CAN
CC CAUSE CYTOLYSIS OF CERTAIN TUMOR CELL LINES, IT IS IMPLICATED
CC IN THE INDUCTION OF CACHEXIA, IT IS A POTENT PYROGEN CAUSING
CC FEVER BY DIRECT ACTION OR BY STIMULATION OF IL-1 SECRETION, IT
CC CAN STIMULATE CELL PROLIFERATION & INDUCE CELL DIFFERENTIATION
CC UNDER CERTAIN CONDITIONS.
CC -!- SUBUNIT: HOMOTRIMER.
CC -!- SUBCELLULAR LOCATION: TYPE II MEMBRANE PROTEIN. ALSO EXISTS AS
CC AN EXTRACELLULAR SOLUBLE FORM.
CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY
CC PROTEOLYTIC PROCESSING.
CC -!- DISEASE: CACHEXIA ACCOMPANIES A VARIETY OF DISEASES, INCLUDING
CC CANCER AND INFECTION, AND IS CHARACTERIZED BY GENERAL ILL
CC HEALTH AND MALNUTRITION.
CC -!- SIMILARITY: BELONGS TO THE TUMOR NECROSIS FACTOR FAMILY.
DR EMBL; X02910; HSTNFA.
DR EMBL; M16441; HSTNFAB.
DR EMBL; X01394; HSTNFR.
DR EMBL; M10988; HSTNFAA.
DR PIR; B23784; QWHUN.
DR PIR; A44189; A44189.
DR PDB; 1TNF; 15-JAN-91.
DR PDB; 2TUN; 31-JAN-94.
DR MIM; 191160; 11TH EDITION.
DR PROSITE; PS00251; TNF.
KW CYTOKINE; CYTOTOXIN; TRANSMEMBRANE; GLYCOPROTEIN; SIGNAL-ANCHOR;
KW MYRISTYLATION; 3D-STRUCTURE.
FT PROPEP 1 76
FT CHAIN 77 233 TUMOR NECROSIS FACTOR.
FT TRANSMEM 36 56 SIGNAL-ANCHOR (TYPE-II PROTEIN).
FT LIPID 19 19 MYRISTATE.
FT LIPID 20 20 MYRISTATE.
FT DISULFID 145 177
FT MUTAGEN 108 108 R->W: BIOLOGICALLY INACTIVE.
FT MUTAGEN 112 112 L->F: BIOLOGICALLY INACTIVE.
FT MUTAGEN 162 162 S->F: BIOLOGICALLY INACTIVE.
FT MUTAGEN 167 167 V->A,D: BIOLOGICALLY INACTIVE.
FT MUTAGEN 222 222 E->K: BIOLOGICALLY INACTIVE.
FT CONFLICT 63 63 F -> S (IN REF. 5).
FT STRAND 89 93
FT TURN 99 100
FT TURN 109 110
FT STRAND 112 113
FT TURN 115 116
FT STRAND 118 119
FT STRAND 124 125
FT STRAND 130 143
FT STRAND 152 159
FT STRAND 166 170
FT STRAND 173 174
FT TURN 183 184
FT STRAND 189 202
FT TURN 204 205
FT STRAND 207 212
FT HELIX 215 217
FT STRAND 218 218
FT STRAND 227 232
SQ SEQUENCE 233 AA; 25644 MW; 279986 CN;
MSTESMIRDV ELAEEALPKK TGGPQGSRRC LFLSLFSFLI VAGATTLFCL LHFGVIGPQR
EEFPRDLSLI SPLAQAVRSS SRTPSDKPVA HVVANPQAEG QLQWLNRRAN ALLANGVELR
DNQLVVPSEG LYLIYSQVLF KGQGCPSTHV LLTHTISRIA VSYQTKVNLL SAIKSPCQRE
TPEGAEAKPW YEPIYLGGVF QLEKGDRLSA EINRPDYLDF AESGQVYFGI IAL
//


Each line begins with a two-character line code, which indicates the
type of data contained in the line. The current line types and line
codes and the order in which they appear in an entry, are shown below:

ID - Identification.
AC - Accession number(s).
DT - Date.
DE - Description.
GN - Gene name(s).
OS - Organism species.
OG - Organelle.
OC - Organism classification.
RN - Reference number.
RP - Reference position.
RC - Reference comments.
RX - Reference cross-references.
RA - Reference authors.
RL - Reference location.
CC - Comments or notes.
DR - Database cross-references.
KW - Keywords.
FT - Feature table data.
SQ - Sequence header.
- (blanks) sequence data.
// - Termination line.


Some entries do not contain all of the line types, and some line types
occur many times in a single entry. Each entry must begin with an
identification line (ID) and end with a terminator line (//). In
addition the following line types are always present in an entry: AC
(once), DT (3 times), DE (1 or more), OS (1 or more), OC (1 or more),
RN (1 or more), RP (1 or more), RA (1 or more), RL (1 or more), SQ
(once), and at least one sequence data line. The other line types (GN,
OG, RC, RM, CC, DR, KW and FT) are optional.

A detailed description of each line type is given in the next section
of this document.

It must be noted that all SWISS-PROT line types exist in the EMBL
Database. A description of the format differences between the SWISS-
PROT and EMBL data banks is given in Appendix C of this document.

The two-character line type code which begins each line is always
followed by three blanks, so that the actual information begins with
the sixth character. Information is not extended beyond character
position 75.


3. THE DIFFERENT LINE TYPES


3.1 The ID line

The ID (IDentification) line is always the first line of an entry. The
general form of the ID line is:

ID ENTRY_NAME DATA_CLASS; MOLECULE_TYPE; SEQUENCE_LENGTH.

3.1.1 Entry Name

The first item on the ID line is the entry name of the sequence. This
name is a useful means of identifying a sequence. The entry name
consists of up to ten uppercase alphanumeric characters.

SWISS-PROT uses a general purpose naming convention which can be
symbolized as X_Y, where

X is a mnemonic code of at most 4 alphanumeric characters representing
the protein name. Examples: B2MG is for Beta-2-microglobulin, HBA is
for Hemoglobin alpha chain and INS is for Insulin.

The `_' sign serves as a separator.

Y is a mnemonic species identification code of at most 5 alphanumeric
characters representing the biological source of the protein. This
code is generally made of the first three letters of the genus and
the first two letters of the species. Examples: PSEPU is for
Pseudomonas putida and NAJNI is for Naja nivea.

However, for species commonly encountered in the data bank, self-
explanatory codes are used. There are 16 of those codes. They are:
BOVIN for Bovine, CHICK for Chicken, ECOLI for Escherichia coli,
HORSE for Horse, HUMAN for Human, MAIZE for Maize (Zea mays) , MOUSE
for Mouse, PEA for Garden pea (Pisum sativum), PIG for Pig, RABIT
for Rabbit, RAT for Rat, SHEEP for Sheep, SOYBN for Soybean (Glycine
max), TOBAC for Common tobacco (Nicotina tabacum), WHEAT for Wheat
(Triticum aestivum), YEAST for Baker's yeast (Saccharomyces
cerevisiae).

As it was not possible to apply the above rules to viruses, they
were given arbitrary, but generally easy to remember, identification
codes. In some cases it was not possible to assign a definitive code
to a species. In these cases a temporary code was chosen.

Examples of complete protein sequence entry names are: RL1_ECOLI for
ribosomal protein L1 from Escherichia coli, FER_HALHA for ferredoxin
from Halobacterium halobium.

The name of all the presently defined species identification codes are
listed in the SWISS-PROT document file SPECLIST.TXT.


3.1.2 Data class

The second item on the ID line indicates the data class of the entry
(see section 2.2).

3.1.3 Molecule type

The third item on the ID line is a three letter code which indicates
the type of molecule of the entry: in SWISS-PROT it is PRT (for
PRoTein).

3.1.4 Length of the molecule

The fourth and last item of the ID line is the length of the molecule,
which is the total number of amino acids in the sequence. This number
includes the positions reported to be present but which have not been
determined (coded as `X'). The length is followed by the letter code AA
(Amino Acids).

3.1.5 Examples of identification lines

Two examples of ID lines are shown below:

ID CYC_BOVIN STANDARD; PRT; 104 AA.
ID GIA2_GIALA PRELIMINARY; PRT; 296 AA.


3.2 The AC line

The AC (ACcession number) line lists the accession numbers associated
with an entry. An example of an accession number line is shown below:

AC P00321; P05348;

The accession numbers are separated by semicolons and the list is
terminated by a semicolon. If necessary, more then one AC line will be
used. Most SWISS-PROT sequence entries currently have only one
accession number.

The purpose of accession numbers is to provide a stable way of
identifying entries from release to release. It is sometimes necessary
for reasons of consistency to change the names of the entries, for
example, to ensure that related entries have similar names. However, an
accession number is always conserved, and therefore allows unambiguous
citation of SWISS-PROT entries.

Researchers who wish to cite entries in their publications should
always cite the first accession number.

Entries will have more than one accession number if they have been
merged or split. For example, when two entries are merged into one, a
new accession number goes at the start of the AC line, and those from
the merged entries are listed after this one. Similarly, if an existing
entry is split into two or more entries (a rare occurrence), the
original accession number list is retained in all the derived entries.

An accession number is dropped only when the data to which it was
assigned have been completely removed from the data bank.



3.3 The DT line

The DT (DaTe) lines show the date of entry or last modification of the
sequence entry. The format of the DT lines is:

DT DD-MMM-YEAR (REL. XX, COMMENT)

where `DD' is the day, `MMM' the month, `YEAR' the year, and `XX' the
SWISS-PROT release number. The comment portion of the line indicates
the action taken on that date. There are ALWAYS three DT lines in each
entry, each of them is associated with a specific comment:

- The first DT line indicates when the entry first appeared in the
data bank. The associated comment is `CREATED'.
- The second DT line indicates when the sequence data was last
modified. The associated comment is `LAST SEQUENCE UPDATE'.
- The third DT line indicates when any data other then the sequence
was last modified. The associated comment is `LAST ANNOTATION
UPDATE'.

Example of a block of DT lines:

DT 01-JAN-1988 (REL. 06, CREATED)
DT 01-JUL-1989 (REL. 11, LAST SEQUENCE UPDATE)
DT 01-AUG-1992 (REL. 23, LAST ANNOTATION UPDATE)


3.4 The DE line

The DE (DEscription) lines contain general descriptive information
about the sequence stored. This information is generally sufficient to
identify the sequence precisely. The format of the DE lines is:

DE DESCRIPTION.

The description is given in ordinary English and is free-format. In
some cases, more than one DE line is required; in this case, the text
is divided only between words and only the last DE line is terminated
by a period.

When the complete sequence was not determined the last information
given on the DE lines will be `(FRAGMENT)' or `(FRAGMENTS)'.

Two examples of description lines are given here:

DE NADH DEHYDROGENASE (EC 1.6.99.3).

DE LYSOPINE DEHYDROGENASE (EC 1.5.1.16) (OCTOPINE SYNTHASE)
DE (LYSOPINE SYNTHASE) (FRAGMENT).



3.5 The GN line

The GN (Gene Name) line contains the name(s) of the gene(s) that encode
for the stored protein sequence. The format of the GN line is:

GN NAME1[ AND|OR NAME2...].

Examples:

GN ALB.
GN REX-1.

It often occurs that more than one gene name has been assigned to an
individual locus. In that case all the synonyms will be listed. The
word `OR' separates the different designations. The first name in the
list is assumed to be the most correct (or most current) designation.
Example:

GN HNS OR DRDX OR OSMZ OR BGLY.

In a few cases, multiple genes encode for an identical protein
sequence. In that case all the different gene names will be listed. The
word `AND' separates the designations. Example:

GN CECA1 AND CECA2.

In very rare cases `AND' and `OR' can both be present. In that case
parenthesis are used as shown in the following example:

GN GVPA AND (GVPB OR GVPA2).



3.6 The KW line

The KW (KeyWord) lines provide information which can be used to
generate cross-reference indexes of the sequence entries based on
functional, structural, or other categories. The keywords chosen for
each entry serve as a subject reference for the sequence. Often several
KW lines are necessary for a single entry. The format of the KW lines
is:

KW KEYWORD[; KEYWORD...].

More than one keyword may be listed on each KW line; the keywords are
separated by semicolons, and the last keyword is followed by a period.
Keywords may consist of more than one word (they may contain blanks),
but are never split between lines. An example of a KW line is:

KW EYE LENS PROTEIN; ACETYLATION.

The order of the keywords is not significant. The above example could
also have been written:

KW ACETYLATION; EYE LENS PROTEIN.



3.7 The OS line

The OS (Organism Species) line specifies the organism(s) which was the
source of the stored sequence. In the rare case where all the species
information will not fit on a single line more than one OS line is
used. The last OS line is terminated by a period.

The species designation consists, in most cases, of the Latin genus and
species designation followed by the English name (in parentheses). For
viruses, only the common English name is given. In cases where a
protein sequence is identical in more then one species the OS line(s)
will list the names of all those species.

Examples of OS lines are shown here:

OS ESCHERICHIA COLI.
OS HOMO SAPIENS (HUMAN).
OS ROUS SARCOMA VIRUS (STRAIN SCHMIDT-RUPPIN).
OS NAJA NAJA (INDIAN COBRA), AND NAJA NIVEA (CAPE COBRA).



3.8 The OG line

The OG (OrGanelle) lines indicate if the gene coding for a protein
originates from the mitochondria, the chloroplast, a cyanelle, or a
plasmid. The format of the OG line is:

OG CHLOROPLAST.
OG CYANELLE.
OG MITOCHONDRION.
OG PLASMID name.

Where 'name' is the name of the plasmid.


3.9 The OC line

The OC (Organism Classification) lines contain the taxonomic
classification of the source organism. The classification is listed
top-down as nodes in a taxonomic tree in which the most general
grouping is given first. The classification may be distributed over
several OC lines, but nodes are not split or hyphenated between lines.
The individual items are separated by semicolons and the list is
terminated by a period. The format of the OC lines is:

OC NODE[; NODE...].

For example the classification lines for a human sequence would be:

OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA;
OC EUTHERIA; PRIMATES.


3.10 The reference (RN, RP, RC, RX, RA, RL) lines

These lines comprise the literature citations within SWISS-PROT. The
citations indicate the papers from which the data has been abstracted.
The reference lines for a given citation occur in a block, and are
always in the order RN, RP, RC, RM, RA, RL. Within each such reference
block the RN and RP lines occur once, the RC line occurs zero or more
times, the RM line occurs zero or once, and the RA and RL lines each
occur one or more times. If several references are given, there will be
a reference block for each.

An example of a complete reference is:

RN [1]
RP SEQUENCE FROM N.A., AND SEQUENCE OF 1-15.
RC STRAIN=SPRAGUE-DAWLEY; TISSUE=LIVER;
RM 91002678
RA CHAN Y.-L., PAZ V., OLVERA J., WOOL I.G.;
RL BIOCHIM. BIOPHYS. ACTA 1050:69-73(1990).

The formats of the individual lines are explained below.

3.10.1 The RN line

The RN (Reference Number) line gives a sequential number to each
reference citation in an entry. This number is used to indicate the
reference in comments and feature table notes. The format of the RN
line is:

RN [N]

where N denotes the nth reference for this entry. The reference number
is always enclosed in square brackets.

3.10.2 The RP line

The RP (Reference Position) line describes the extent of the work
carried out by the authors of the reference cited. The format of the RP
line is:

RP COMMENT.

Typical examples of RP lines are shown below:

RP SEQUENCE FROM N.A.
RP SEQUENCE FROM N.A., AND SEQUENCE OF 12-35.
RP SEQUENCE OF 34-56; 67-73 AND 123-345, AND DISULFIDE BONDS.
RP REVISIONS TO 67-89.
RP STRUCTURE BY NMR.
RP X-RAY CRYSTALLOGRAPHY (1.8 ANGSTROMS).
RP CHARACTERIZATION.
RP MUTAGENESIS OF TYR-56.
RP REVIEW.
RP VARIANT ALA-58.
RP VARIANTS XLI LEU-341; ARG-372 AND TYR-446.



3.10.3 The RC line

The RC (Reference Comment) lines are optional lines which are used to
store comments relevant to the reference cited. The format of the RC
line is:

RC TOKEN1=TEXT; TOKEN2=TEXT; .....

Where the currently defined tokens are:

PLASMID
SPECIES
STRAIN
TISSUE
TRANSPOSON

The `SPECIES' token is only used when an entry describes a sequence
which is identical in more than one species; similarly the `PLASMID' is
only used if an entry describes a sequence identical in more than one
plasmid.

An example of an RC line is:

RC STRAIN=SPRAGUE-DAWLEY; TISSUE=LIVER;
3.10.4 The RX line

The RX (Reference cross-reference) line is an optional line which is
used to indicate the identifier assigned to a specific reference in a
bibliographic database. The format of the RX line is:

RX BIBLIOGRAPHIC_DATABASE_NAME; IDENTIFIER.

where the valid bibliographic database names and their associated
identifier are:

Name: MEDLINE
Database: Medline from the National Library of Medicine (NLM)
Identifier: Eight digit Medline Unique Identifier (UID)

Example of RX line:

RX MEDLINE; 91002678.

3.10.5 The RA line

The RA (Reference Author) lines list the authors of the paper (or other
work) cited. All of the authors are included, and are listed in the
order given in the paper. The names are listed surname first followed
by a blank followed by initial(s) with periods. The authors' names are
separated by commas and terminated by a semicolon. Author names are not
split between lines. An example of the use of RA lines is shown below:

RA YANOFSKY C., PLATT T., CRAWFORD I.P., NICHOLS B.P., CHRISTIE G.E.,
RA HOROWITZ H., VAN CLEEMPUT M., WU A.M.;

As many RA lines as necessary are included for each reference.

3.10.6 The RL line

The RL (Reference Location) lines contain the conventional citation
information for the reference. In general, the RL lines alone are
sufficient to find the paper in question.

a) Journal citations

The RL line for a journal citation includes the journal abbreviation,
the volume number, the page range, and the year. The format for such a
RL line is:

RL JOURNAL VOL:PP-PP(YEAR).

Journal names are abbreviated according to the conventions used by the
National Library of Medicine (NLM) and are based on the existing ISO
and ANSI standards. A list of the abbreviations currently in use is
given in the SWISS-PROT document file JOURLIST.TXT.


An example of an RL line is:

RL J. MOL. BIOL. 168:321-331(1983).

When a reference is made to a paper which is `in press' at the time
when the data bank is released, the page range, and eventually the
volume number are indicated as '0' (zero). An example of a RL line of
such type is shown here:

RL NUCLEIC ACIDS RES. 22:0-0(1994).

b) Book citations

A variation of the RL line format is used for papers found in books or
other similar publications, which are cited as shown below:

RL (IN) THE ENZYMES, 3RD ED., VOL.11, PART A, BOYER P.D., ED.,
RL PP.397-547, ACADEMIC PRESS, NEW YORK, (1975).

The first RL line contains the designation `(IN)', which indicates that
this is a book reference. These citations generally include the
following information: the title of the book, the name of the
editor(s), the page range, the publisher name, the city where it is
published, and the year of publication (which is always shown between
parenthesis).

c) Unpublished results

RL lines for unpublished results follows the format shown in the
following example:

RL UNPUBLISHED RESULTS, CITED BY:
RL ULRICH E.L., KROGMANN D.W., MARKLEY J.L.;
RL J. BIOL. CHEM. 257:9356-9364(1982).

d) Unpublished observations

For unpublished observations the format of the RL line is:

RL UNPUBLISHED OBSERVATIONS (MMM-YEAR).

Where `MMM' is the month and `YEAR' is the year.

We use the `unpublished observations' RL line to cite communications by
scientists to SWISS-PROT of unpublished information concerning various
aspects of a sequence entry.

e) Thesis

For Ph.D. theses the format of the RL line is:

RL THESIS (YEAR), INSTITUTION_NAME, COUNTRY.


An example of such a line is given here:

RL THESIS (1972), GEORGE WASHINGTON UNIVERSITY, U.S.A.

f) Patent applications

For patent applications the format of the RL line is:

RL PATENT NUMBER PAT_NUMB, DD-MMM-YYYY.

Where `PAT_NUMB' is the international publication number of the patent,
`DD' is the day, `MMM' is the month and `YEAR' is the year.

g) Submissions

The final form that an RL line can take is that used for submissions.
The format of such a RL line is:

RL SUBMITTED (MMM-YEAR) TO DATABASE_NAME.

Where `MMM' is the month, `YEAR' is the year and `DATABASE_NAME' is one
of the following:

EMBL/GENBANK/DDBJ DATA BANKS
THE SWISS-PROT DATA BANK
THE ECOSEQ DATA BANK
THE HIV DATA BANK
THE MIM DATA BANK
THE NEWAT DATA BANK
THE PDB DATA BANK
THE PIR DATA BANK

Two examples of submission RL lines are given here:

RL SUBMITTED (APR-1994) TO EMBL/GENBANK/DDBJ DATA BANKS.
RL SUBMITTED (FEB-1995) TO THE SWISS-PROT DATA BANK.


3.11 The DR line

3.11.1 Definition

The DR (Database cross-Reference) lines are used as pointers to
information related to SWISS-PROT entries and found in data collections
other than SWISS-PROT.

For example, if the X-ray crystallographic atomic coordinates of a
sequence are stored in the Brookhaven Protein Data Bank (PDB) there
will be DR line(s) pointing to the corresponding entri(es) in that data
bank. For a sequence translated from a nucleotide sequence there can be
DR lines pointing to entries in the EMBL or Genbank data banks which
correspond to the DNA or RNA sequence(s) from which it was translated.

The format of the DR line is:

DR DATA_BANK_IDENTIFIER; PRIMARY_IDENTIFIER; SECONDARY_IDENTIFIER.


3.11.2 Data bank identifier

The first item on the DR line, the data bank identifier, is the
abbreviated name of the data collection to which reference is made. The
currently defined data bank identifiers are the following:

EMBL Nucleotide sequence database of EMBL (EBI)
DICTYDB Dictyostelium discoideum genome database
ECO2DBASE Escherichia coli gene-protein database (2D gel
spots) (ECO2DBASE)
ECOGENE Escherichia coli K12 genome database (EcoGene)
FLYBASE Drosophila genome database (FlyBase)
GCRDB G-protein--coupled receptor database (GCRDb)
HIV HIV sequence database
HSSP Homology-derived secondary structure of proteins
database (HSSP).
LISTA Yeast (Saccharomyces cerevisiae) genome
database (LISTA)
MAIZEDB Maize genome database (MaizeDB)
MIM Mendelian Inheritance in Man Database (MIM)
PDB Brookhaven Protein Data Bank (PDB)
PIR Protein sequence database of the Protein
Information Resource (PIR)
PROSITE PROSITE dictionary of sites and patterns in
proteins
REBASE Restriction enzyme database (REBASE)
AARHUS/GHENT-2DPAGE Human keratinocyte 2D gel protein database from
Aarhus and Ghent universities
SGD Saccharomyces Genome Database (SGD)
STYGENE Salmonella typhimurium LT2 genome database
(StyGene)
SUBTILIST Bacillus subtilis 168 genome database (SubtiList)
SWISS-2DPAGE Human 2D Gel Protein Database from the University
of Geneva (SWISS-2DPAGE)
TRANSFAC Transcription factor database (Transfac)
WORMPEP Caenorhabditis elegans genome sequencing project
protein database (Wormpep)
YEPD Yeast electrophoresis protein database (YEPD)


3.11.3 The primary identifier

The second item on the DR line, the primary identifier, is an
unambiguous pointer to the information entry in the data bank to which
reference is being made.

- For an EMBL, DictyDb, EcoGene, FlyBase, GCRDb, HIV, LISTA, PIR,
PROSITE, SGD, StyGene, SubtiList, SWISS-2DPAGE or Transfac reference
the primary identifier is the first accession number (also called
the Unique Identifier in some data banks) of the entry to which
reference is being made.
- For a MIM reference the primary identifier is the catalog number of
the disease (or phenotype).
- For a PDB or REBASE reference the primary identifier is the entry
name.
- For an AARHUS/GHENT-2DPAGE, ECO2DBASE or YEPD reference the primary
identifier is the protein spot alphanumeric designation.
- For a WormPep reference the primary identifier is the cosmid-derived
name given to that protein by the C.elegans genome sequencing
project.
- For a MaizeDB reference the primary identifier is the "Gene-product"
accession ID.
- For a HSSP reference the primary identifier is the accession number
of a SWISS-PROT entry cross-referenced to a PDB entry whose
structure is expected to be similar to that of the entry in which
the HSSP cross-reference is present.


3.11.4 The secondary identifier

The third and last item on the DR line, the secondary identifier, is
used to complement the information given by the first identifier.

- For an EMBL, GenBank, HIV, PIR or PROSITE reference the secondary
identifier is the entry's name.
- For a PDB reference the secondary identifier is the most recent date
on which PDB revised the entry (last `REVDAT' record).
- For a DictyDb, EcoGene, FlyBase, LISTA, SGD, StyGene or SubtiList
reference the secondary identifier is the gene designation. If the
gene designation is not available a dash "-" is used.
For a MIM, REBASE, or ECO2DBASE reference the secondary identifier
is the latest release number or edition of the database that has
been used to derive the cross-reference.
- For a SWISS-2DPAGE reference the secondary identifier is the species
of origin.
- For an AARHUS/GHENT-2DPAGE reference the secondary identifier is
either `IEF' (for isoelectric focusing) or `NEPHGE' (for non-
equilibrium pH gradient electrophoresis).
- For a WormPep reference the secondary identifier is a number
attributed by the C.elegans genome sequencing project to that
protein.
- For a GCRDb, MaizeDB, Transfac or YEPD reference the secondary
identifier is not defined and a dash "-" is stored in that field.
- For a HSSP reference the secondary identifier is the entry name of
the PDB structure related to that of the entry in which the HSSP
cross-reference is present.


Examples of complete DR lines are shown here:

DR AARHUS/GHENT-2DPAGE; 8006; IEF.
DR DICTYDB; DD01047; MYOA.
DR EMBL; X01704; GMNOD23.
DR ECO2DBASE; G052.0; 6TH EDITION.
DR ECOGENE; EG10054; ARAC.
DR FLYBASE; FBGN0000055; ADH.
DR GCRDB; GCR_0087; -.
DR HIV; K02013; NEF$BRU.
DR HSSP; P00438; 1DOB.
DR LISTA; SC00018; ACT1.
DR MAIZEDB; 25342; -.
DR MIM; 249900; 11TH EDITION.
DR PDB; 3ADK; 16-APR-88.
DR PIR; A02768; R5EC7.
DR PROSITE; PS00021; KRINGLE.
DR REBASE; BSURI; RELEASE 9410.
DR SGD; L0000008; AAR2.
DR STYGENE; SG10312; PROV.
DR SUBTILIST; BG10774; OPPD.
DR SWISS-2DPAGE; P10599; HUMAN.
DR TRANSFAC; T00141; -.
DR WORMPEP; ZK637.7; CE00437.
DR YEPD; 4270; -.


3.12 The FT line

The FT (Feature Table) lines provide a precise but simple means for the
annotation of the sequence data. The table describes regions or sites
of interest in the sequence. In general the feature table lists post-
translational modifications, binding sites, enzyme active sites, local
secondary structure or other characteristics reported in the cited
references. Sequence conflicts between references are also included in
the feature table. The feature table is updated when more becomes known
about a given sequence.

The FT lines have a fixed format. The column numbers allocated to each
of the data items within each FT line are shown in the following table
(column numbers not referred to in the table are always occupied by
blanks):

+---------------+-----------------------+
| Columns | Data item |
+---------------+-----------------------+
| 1- 2 | FT |
| 6-13 | Key name |
| 15-20 | `FROM' endpoint |
| 22-27 | `TO' endpoint |
| 35-75 | Description |
+---------------+-----------------------+

The key name and the endpoints are always on a single line, but the
description may require continuation. For this purpose, the next line
contains blanks in the key, the `FROM', and the `TO' columns positions,
and the description is continued in its normal position. Thus a blank
key always denotes a continuation of the previous description.

An example of a feature table is shown below:

FT NON_TER 1 1
FT PEPTIDE 1 9 ARG-VASOPRESSIN.
FT PEPTIDE 13 107 NEUROPHYSIN 2.
FT PEPTIDE 109 147 COPEPTIN.
FT DISULFID 1 6
FT MOD_RES 9 9 AMIDATION (ACTIVE ARG-VASOPRESSIN).
FT CONFLICT 102 102 D -> S (IN REF. 2).
FT CONFLICT 105 105 MISSING (IN REF. 3).
FT CARBOHYD 114 114

The first item on each FT line is the key name, which is a fixed
abbreviation (up to 8 characters) with a defined meaning. A list of the
currently defined key names can be found in Appendix A of this
document.

Following the key name are the `FROM' and `TO' endpoint specifications.
These fields designate (inclusively) the endpoints of the feature named
in the key field. In general, these fields simply contain residue
numbers indicating positions in the sequence as listed. Note that these
positions are always specified assuming a numbering of the listed
sequence from 1 to n; this numbering is not necessarily the same as
that used in the original reference(s). The following should be noted
in interpreting these endpoints:

- If the `FROM' and `TO' specifications are equal, the feature
indicated consists of the single amino acid at that position.

- When a feature is known to extend beyond the end(s) of the sequenced
region, the endpoint specification will be preceded by < for
features which continue to the left end (N-terminal direction) or by
> for features which continue to the right end (C-terminal
direction).

- Unknown endpoints are denoted by `?'.

See also the notes concerning each of the key names in the appendix A.

The remaining portion of the FT line is a description which contains
additional information about the feature. For example, for a residue
post-translational modification (key MOD_RES) the chemical nature of
that modification is given, while for a sequence variation (key
VARIANT) the nature of the variation is indicated. This portion of the
line is generally in free form, and may be continued on additional
lines when necessary.


3.13 The SQ line

The SQ (SeQuence header) line marks the beginning of the sequence data
and gives a quick summary of its content. The format of the SQ line is:

SQ SEQUENCE XXXX AA; XXXXX MW; XXXXX CN;

The line contains the length of the sequence in amino-acids (AA)
followed by the molecular weight (MW) rounded to the nearest gram and a
checking number (CN) as defined in the following reference:

Bairoch A.
Biochem. J. 203:527-528(1982).

An example of an SQ line is shown here:

SQ SEQUENCE 104 AA; 11530 MW; 54319 CN;

The information in the SQ line can be used as a check on accuracy or
for statistical purposes. The word `SEQUENCE' is present solely for
readability.


3.14 The sequence data line

The sequence data line has a line code consisting of two blanks rather
than the two-letter codes used up until now. The sequence is written 60
amino acids per line, in groups of 10 amino acids, beginning in
position 6 of the line.

The characters used for the amino acids are the standard IUPAC one
letter codes (see Appendix B).

An example of sequence data lines is shown here:

GDVEKGKKIF IMKCSQCHTV EKGGKHKTGP NLHGLFGRKT GQAPGYSYTA ANKNKGIIWG
EDTLMEYLEN PKKYIPGTKM IFVGIKKKEE RADLIAYLKK ATNE


3.15 The CC line

The CC lines are free text comments on the entry, and may be used to
convey any useful information. The comments always appears below the
last reference line and are grouped together in comment blocks, a block
being made of 1 or more comment lines. The first line of a block start
is marked with the characters `-!-'.

The format of a comment block is:

CC -!- FIRST LINE OF A COMMENT BLOCK.
CC SECOND AND SUBSEQUENT LINES OF A COMMENT BLOCK.

A major proportion of the comment blocks are arranged according to what
we designate as 'topics`. The format of a comment block which belongs
to a 'topic` is:

CC -!- TOPIC: FREE TEXT DESCRIPTION.


The current topics and their definition are:

ALTERNATIVE PRODUCTS Description of the existence of related
protein sequence(s) produced by alternative splicing
of the same gene(s) or by the use of alternative
initiation codons.
CATALYTIC ACTIVITY Description of the reaction(s) catalysed by an
enzyme [*].
CAUTION This topic warns you about possible errors and/or
grounds for confusion.
COFACTOR Description of an enzyme cofactor.
DEVELOPMENTAL STAGE Description of the developmental specific
expression of a protein.
DISEASE Description of the disease(s) associated with a
deficiency of a protein.
DOMAIN Description of the domain structure of a protein.
ENZYME REGULATION Description of an enzyme regulatory mechanism.
FUNCTION General description of the function(s) of a protein.
INDUCTION Description of the compound(s) which stimulate the
synthesis of a protein.
PATHWAY Description of the metabolic pathway(s) to which is
associated a protein.
POLYMORPHISM Description of polymorphism(s).
PTM Description of a post-translational modification.
SIMILARITY Description of the similariti(es) (sequence or
structural) of a protein with other proteins.
SUBCELLULAR LOCATION Description of the subcellular location of a
mature protein product.
SUBUNIT Description of the quaternary structure of a
protein.
TISSUE SPECIFICITY Description of the tissue specificity of a protein.

We show here, for each of the topics defined above, two examples of its
usage:

CC -!- ALTERNATIVE PRODUCTS: SKELETAL MUSCLE AND FIBROBLAST
CC TROPOMYOSINS ARE OBTAINED BY ALTERNATIVE MRNA SPLICING.

CC -!- ALTERNATIVE PRODUCTS: USING ALTERNATIVE INITIATION CODONS IN
CC THE SAME READING FRAME, THE GENE TRANSLATES INTO THREE
CC ISOZYMES: ALPHA, BETA AND BETA'.

CC -!- CATALYTIC ACTIVITY: ATP + L-GLUTAMATE + NH(3) = ADP +
CC GLUTAMINE + ORTHOPHOSPHATE.

CC -!- CATALYTIC ACTIVITY: (R)-2,3-DIHYDROXY-3-METHYLBUTANOATE +
CC NADP(+) = (S)-2-HYDROXY-2-METHYL-3-OXOBUTANOATE + NADPH.

CC -!- CAUTION: REF.2 SEQUENCE DIFFERS FROM THAT SHOWN IN POSITIONS
CC 92 TO 165 DUE TO A FRAMESHIFT.

CC -!- CAUTION: IT IS UNCERTAIN WHETHER MET-1 OR MET-3 IS THE
CC INITIATOR.

CC -!- COFACTOR: PYRIDOXAL PHOSPHATE.

CC -!- COFACTOR: FAD FLAVOPROTEIN AND NONHEME IRON.

CC -!- DEVELOPMENTAL STAGE: EXPRESSED EARLY DURING CONIDIAL (DORMANT
CC SPORES) DIFFERENTIATION.

CC -!- DEVELOPMENTAL STAGE: EXPRESSED IN EMBRYONIC AND EARLY LARVAL
CC STAGES.

CC -!- DISEASE: DEFECTS IN PHKA1 ARE LINKED TO X-LINKED MUSCLE
CC GLYCOGENOSIS, A DISEASE CHARACTERIZED BY SLOWLY PROGRESSIVE,
CC PREDOMINANTLY DISTAL MUSCLE WEAKNESS AND ATROPHY.

CC -!- DISEASE: DEFECTS IN ALD ARE THE CAUSE OF X-LINKED
CC ADRENOLEUKODYSTROPHY, A PEROXISOMAL DISORDER CHARACTERIZED BY
CC PROGRESSIVE DEMYLEINATION OF THE CNS AND ADRENAL
CC INSUFFICIENCY.

CC -!- DOMAIN: CONTAINS A COILED-COIL DOMAIN ESSENTIAL FOR VESICULAR
CC TRANSPORT AND A DISPENSABLE C-TERMINAL REGION.

CC -!- DOMAIN: THE B CHAIN IS COMPOSED OF TWO DOMAINS, EACH DOMAIN
CC CONSISTS OF 3 HOMOLOGOUS SUBDOMAINS (ALPHA, BETA, GAMMA).

CC -!- ENZYME REGULATION: THE ACTIVITY OF THIS ENZYME IS CONTROLLED
CC BY ADENYLATION. THE FULLY ADENYLATED ENZYME IS INACTIVE.

CC -!- ENZYME REGULATION: ACTIVATED BY GRAM-NEGATIVE BACTERIAL
CC LIPOPOLYSACCHARIDES AND CHYMOTRYPSIN.

CC -!- FUNCTION: PROFILIN PREVENTS THE POLYMERIZATION OF ACTIN.

CC -!- FUNCTION: INHIBITOR OF FUNGAL POLYGALACTURONASE. IT IS AN
CC IMPORTANT FACTOR FOR PLANT RESISTANCE TO PHYTOPATHOGENIC
CC FUNGI.

CC -!- INDUCTION: BY SALT STRESS AND BY ABSCISIC ACID (ABA).

CC -!- INDUCTION: BY INFECTION, PLANT WOUNDING, OR ELICITOR
CC TREATEMENT OF CELL CULTURES.

CC -!- PATHWAY: FIRST STEP IN PROLINE BIOSYNTHESIS PATHWAY.

CC -!- PATHWAY: LAST STEP IN PROTOHEME BIOSYNTHESIS. IN ERYTHROID
CC CELLS, FERROCHELATASE APPEARS TO BE THE RATE-LIMITING ENZYME.

CC -!- POLYMORPHISM: THE ALLELIC FORM OF THE ENZYME WITH GLN-191
CC HYDROLYZES PARAOXON WITH A LOW TURNOVER NUMBER AND THE ONE
CC WITH ARG-191 WITH A HIGH TURNOVER NUMBER.

CC -!- POLYMORPHISM: THE TWO MAIN ALLELES OF HP ARE CALLED HP1F
CC (FAST) AND HP1S (SLOW). THE SEQUENCE SHOWN HERE IS THAT OF THE
CC HP1S FORM.

CC -!- PTM: O-GLYCOSYLATED; AN UNUSUAL FEATURE AMONG VIRAL
CC GLYCOPROTEINS.

CC -!- PTM: THE SOLUBLE FORM DERIVES FROM THE MEMBRANE FORM BY
CC PROTEOLYTIC PROCESSING.

CC -!- SIMILARITY: BELONGS TO THE SUBTILASE PROTEASES FAMILY. STRONG
CC SIMILARITY WITH OTHER FURIN-LIKE ENZYMES.

CC -!- SIMILARITY: BELONGS TO THE ATP-BINDING TRANSPORT PROTEIN
CC FAMILY (ABC TRANSPORTERS). BELONGS TO THE MDR SUBFAMILY.

CC -!- SUBCELLULAR LOCATION: MITOCHONDRIAL MATRIX.

CC -!- SUBCELLULAR LOCATION: INTEGRAL MEMBRANE PROTEIN. INNER
CC MEMBRANE.

CC -!- SUBUNIT: HOMOTETRAMER.

CC -!- SUBUNIT: HETERODIMER OF A LIGHT CHAIN AND A HEAVY CHAIN LINKED
CC BY A DISULFIDE BOND.

CC -!- TISSUE SPECIFICITY: KIDNEY, SUBMAXILLARY GLAND, AND URINE.

CC -!- TISSUE SPECIFICITY: SHOOTS, ROOTS, AND COTYLEDON FROM
CC DEHYDRATING SEEDLINGS.


[*] Whenever it was possible we have used, to describe the catalytic
activity of an enzyme, the recommendations of the Nomenclature
Committee of the International Union of Biochemistry and
Molecular Biology (IUBMB) as published in:

Enzyme Nomenclature, NC-IUBMB, Academic Press, New-York, (1992).


3.16 The // line

The // (terminator) line contains no data or comments. It designates
the end of an entry.


APPENDIX A: FEATURE TABLE KEYS

The definition of each of the key names used in the feature table is
explained here. It is probable that new key names will be progressively
be added to this list. For each key a number of examples are presented.



A.1 Change indicators

CONFLICT - Different papers report differing sequences.

Examples of CONFLICT key feature lines:

FT CONFLICT 33 33 MISSING (IN REF. 2).
FT CONFLICT 60 60 P -> A (IN REF. 3 AND 4).
FT CONFLICT 81 84 ASTQ -> GWT (IN REF. 3).


VARIANT - Authors report that sequence variants exist.

Examples of VARIANT key feature lines:

FT VARIANT 3 3 V -> I.
FT VARIANT 87 87 L -> T (IN STRAIN 2.3.1).
FT VARIANT 1 2 MISSING (IN 25% OF THE CHAINS).


VARSPLIC - Description of sequence variants produced by alternative
splicing.

Examples of VARSPLIC key feature lines:

FT VARSPLIC 194 196 GRP -> DVR (IN SHORT FORM).
FT VARSPLIC 197 211 MISSING (IN SHORT FORM).


MUTAGEN - Site which has been experimentally altered.

Examples of MUTAGEN key feature lines:

FT MUTAGEN 65 65 H->F: 100% LOSS OF ACTIVITY.
FT MUTAGEN 123 123 G->R,L,M: DNA BINDING LOST.


A.2 Amino acid modifications

MOD_RES - Post-translational modification of a residue.

The chemical nature of the modification is given in the description.


The general format of the MOD_RES description field is:

FT MOD_RES xxx xxx MODIFICATION (COMMENT).

The most frequently occuring modifications are the following:

ACETYLATION N-terminal or other.
AMIDATION Generally at the C-terminal of a mature
active peptide.
BLOCKED Undetermined N- or C-terminal blocking
group.
FORMYLATION Of the N-terminal methionine.
GAMMA-CARBOXYGLUTAMIC ACID
HYDROXYLATION Of asparagine, aspartic acid, proline or
lysine.
METHYLATION Generally of lysine or arginine.
PHOSPHORYLATION Of serine, threonine, tyrosine, aspartic
acid or histidine.
PYRROLIDONE CARBOXYLIC ACID N-terminal glutamate which has formed an
internal cyclic lactam.
SULFATATION Generally of tyrosine.

Examples of MOD_RES key feature lines:

FT MOD_RES 1 1 ACETYLATION.
FT MOD_RES 11 11 PHOSPHORYLATION (BY PKC).
FT MOD_RES 2 2 SULFATATION (BY SIMILARITY).
FT MOD_RES 8 8 AMIDATION (G-9 PROVIDE AMIDE GROUP).
FT MOD_RES 9 9 METHYLATION (MONO-, DI- & TRI-).


LIPID - Covalent binding of a lipidic moiety

The chemical nature of the bound lipid moiety is given in the
description. The general format of the LIPID description field is:

FT LIPID xxx xxx MODIFICATION (COMMENT).

The modifications which are currently defined are the following:

MYRISTATE Myristate group attached through an amide bond to
the N-terminal glycine residue of the mature form of
a protein [1,2] or to an internal lysine residue.

PALMITATE Palmitate group attached through a thioether bond to
a cysteine residue or through an ester bond to a
serine or threonine residue [1,2].

FARNESYL Farnesyl group attached through a thioether bond to
a cysteine residue [3,4].

GERANYL-GERANYL Geranyl-geranyl group attached through a thioether
bond to a cysteine residue [3,4].

GPI-ANCHOR Glycosyl-phosphatidylinositol (GPI) group linked to
the alpha-carboxyl group of the C-terminal residue
of the mature form of a protein [5,6].

N-ACYL DIGLYCERIDE N-terminal cysteine of the mature form of a
prokaryotic lipoprotein with an amide-linked fatty
acid and a glyceryl group to which two fatty acids
are linked by ester linkages [7].


Examples of LIPID key feature lines:

FT LIPID 1 1 MYRISTATE.
FT LIPID 65 65 PALMITATE (BY SIMILARITY).
FT LIPID 354 354 GPI-ANCHOR.


DISULFID - Disulfide bond.

The `FROM' and `TO' endpoints represent the two residues which are
linked by an intra-chain disulfide bond. If the `FROM' and `TO'
endpoints are identical, the disulfide bond is an interchain one and
the description field indicates the nature of the cross-link. Examples
of DISULFID key feature lines:

FT DISULFID 27 44 PROBABLE.
FT DISULFID 14 14 INTERCHAIN (WITH A LIGHT CHAIN).


THIOLEST - Thiolester bond.

The `FROM' and `TO' endpoints represent the two residues which are
linked by the thiolester bond.

THIOETH - Thioether bond.

The `FROM' and `TO' endpoints represent the two residues which are
linked by the thioether bond.


CARBOHYD - Glycosylation site.

The nature of the carbohydrate (if known) is given in the description
field. Examples of CARBOHYD key feature lines:

FT CARBOHYD 103 103 GLUCOSYLGALACTOSE.
FT CARBOHYD 256 256 POTENTIAL.


METAL - Binding site for a metal ion.

The description field indicates the nature of the metal. Examples of
METAL key feature lines:

FT METAL 18 18 IRON (HEME AXIAL LIGAND).
FT METAL 87 87 COPPER (POTENTIAL).


BINDING - Binding site for any chemical group (co-enzyme, prosthetic
group, etc.).

The chemical nature of the group is given in the description field.
Examples of BINDING key feature lines:

FT BINDING 14 14 HEME (COVALENT).
FT BINDING 250 250 PYRIDOXAL PHOSPHATE.


A.3 Regions


SIGNAL - Extent of a signal sequence (prepeptide).

TRANSIT - Extent of a transit peptide (mitochondrial, chloroplastic, or
for a microbody).

Examples of TRANSIT key feature lines:

FT TRANSIT 1 42 CHLOROPLAST.
FT TRANSIT 1 25 MITOCHONDRION.
FT TRANSIT 1 23 MICROBODY (POTENTIAL).


PROPEP - Extent of a propeptide.

Examples of PROPEP key feature lines:

FT PROPEP 27 28 ACTIVATION PEPTIDE.
FT PROPEP 550 574 REMOVED IN MATURE FORM.


CHAIN - Extent of a polypeptide chain in the mature protein.

Examples of CHAIN key feature lines:

FT CHAIN 21 119 BETA-2 MICROGLOBULIN.
FT CHAIN 37 >42 FACTOR XIIIA.


PEPTIDE - Extent of a released active peptide.

Examples of PEPTIDE key feature lines:

FT PEPTIDE 13 107 NEUROPHYSIN 2.
FT PEPTIDE 235 239 MET-ENKEPHALIN.


DOMAIN - Extent of a domain of interest on the sequence.

The nature of that domain is given in the description field. Examples
of DOMAIN key feature lines:

FT DOMAIN 22 788 EXTRACELLULAR (POTENTIAL).
FT DOMAIN 140 152 ANCESTRAL CALCIUM SITE.

CA_BIND - Extent of a calcium-binding region.

DNA_BIND - Extent of a DNA-binding region.

NP_BIND - Extent of a nucleotide phosphate binding region.

The nature of the nucleotide phosphate is indicated in the description
field. Examples of NP_BIND key feature lines:

FT NP_BIND 13 25 ATP.
FT NP_BIND 45 49 GTP (POTENTIAL).
FT NP_BIND 8 34 FAD (ADP PART).

TRANSMEM - Extent of a transmembrane region.

ZN_FING - Extent of a zinc finger region.

Examples of ZN_FING key feature lines:

FT ZN_FING 110 134 GATA-TYPE.
FT ZN_FING 559 579 C4-TYPE.


SIMILAR - Extent of a similarity with another protein sequence.

Precise information, relative to that sequence is given in the
description field. Examples of SIMILAR key feature lines:

FT SIMILAR 351 456 STRONG, WITH KAPPA CHAIN V REGIONS.
FT SIMILAR 580 1182 HIGH, WITH ERBB TRANSFORMING PROTEIN.


REPEAT - Extent of an internal sequence repetition.

Examples of REPEATS key feature lines:

FT REPEAT 75 300 APPROXIMATE.
FT REPEAT 390 600 APPROXIMATE.


A.4 Secondary structure


The feature table of sequence entries of proteins whose tertiary
structure is known experimentally contains the secondary structure
information corresponding to that protein. The secondary structure
assignment is made according to DSSP (see Kabsch W., Sander C.;
Biopolymers, 22:2577-2637(1983)) and the information is extracted from
the coordinate data sets of the Protein Data Bank (PDB).

In the feature table only three types of secondary structure are
specified : helices (key HELIX), beta-strand (key STRAND) and turns
(key TURN). Residues not specified in one of these classes are in a
`loop' or `random-coil' structure). Because the DSSP assignment has
more than the three common secondary structure classes, we have
converted the following DSSP assignments to HELIX, STRAND, and TURN:

DSSP code DSSP definition SWISS-PROT assignment
H Alpha-helix HELIX
G 3(10) helix HELIX
I Pi-helix HELIX
E Hydrogen bonded beta-strand
(extended strand)
STRAND
B Residue in an isolated
beta-bridge
STRAND
T H-bonded turn
(3-turn, 4-turn or 5-turn)
TURN
S Bend (five-residue
bend centered at residue i)
Not specified


One should be aware of the following facts:

  1. Segment Length. For helices (alpha and 3-10), the residue just
    before and just after the helix as given by DSSP participates in the
    helical hydrogen bonding pattern with a single H-bond. For some
    practical purposes, one can therefore extend the HELIX range by one
    residue on each side. E.g. HELIX 25-35 instead of HELIX 26-34. Also,
    the ends of secondary structure segments are less well defined for
    lower resolution structures. A fluctuation of +/- one residue is
    common.

  2. Missing segments. In low resolution structures, badly formed helices
    or strands may be omitted in the DSSP definition.

  3. Special helices and strands. Helices of length three are 3-10
    helices, those of length four and longer are either alpha-helices or
    3-10 helices (pi helices are extremely rare). A strand of length one
    corresponds to a residue in an isolated beta-bridge. Such bridges
    can be structurally important.

  4. Missing secondary structure. No secondary structure is currently
    given in the feature table in the following cases:

    • No sequence data in the PDB entry.
    • Structure for which only C-alpha coordinates are in PDB.
    • NMR structure with more than one coordinate data set.
    • Model (i.e. theoretical) structure.


Examples:

FT HELIX 3 14
FT TURN 15 15
FT TURN 20 21
FT STRAND 23 23
FT HELIX 25 35


A.5 Others


ACT_SITE - Amino acid(s) involved in the activity of an enzyme.

Examples of ACT_SITE key feature lines:

FT ACT_SITE 193 193 ACCEPTS A PROTON DURING CATALYSIS.
FT ACT_SITE 99 99 CHARGE RELAY SYSTEM.

SITE - Any other interesting site on the sequence.

Examples of SITE key feature lines:

FT SITE 285 288 PREVENT SECRETION FROM ER.
FT SITE 241 242 CLEAVAGE (BY ANIMAL COLLAGENASES).

INIT_MET - The sequence is known to start with an initiator methionine.

This feature key is mostly associated with a zero value in the `FROM'
and `TO' fields.

FT INIT_MET 0 0

NON_TER - The residue at an extremity of the sequence is not the
terminal residue.

If applied to position 1, this signifies that the first position is not
the N-terminus of the complete molecule. If applied to the last
position, it signifies that this position is not the C-terminus of the
complete molecule. There is no description field for this key. Examples
of NON_TER key feature lines:

FT NON_TER 1 1
FT NON_TER 150 150

NON_CONS - Non consecutive residues.

Indicates that two residues in a sequence are not consecutive and that
there are a number of unsequenced residues between them. Examples of
NON_CONS key feature lines:

FT NON_CONS 1036 1037
FT NON_CONS 33 34 N-TERMINAL / C-TERMINAL.

UNSURE - Uncertainties in the sequence

Used to describe region(s) of a sequence for which the authors are
unsure about the sequence assignment.