In most nations, genetically engineered foods must be assessed for
their safety before market approval is granted. An important issue in
this safety assessment is the potential allergenicity of transgenic
("foreign") proteins that have been introduced into the food by genetic
engineering. In other words, what is the chance that the foreign
protein may cause allergic reactions after consumption of the
genetically engineered food containing this protein?
Potential allergenicity is assessed during a step-by-step procedure
described by the guidelines of the FAO/WHO Codex alimentarius
Commission for the safety assessment of foods derived from genetically
engineered plants and micro-organisms . One
important step in this procedure is to determine, with the aid of
computer programs, whether the primary structure (amino acid sequence)
of the transgenic protein is similar to sequences of allergenic
proteins, of which the latter are available from public protein
Two types of similarity are searched for:
The similar stretches that are identified this way may harbour
potential binding sites (called epitopes) for IgE antibodies. IgE
antibodies are allergy-related and involved in the binding of the
allergen to mast cells, after which these cells release compounds, such
as histamine, that cause the symptoms of allergy. Allergens must at
least contain two IgE-binding epitopes to trigger a mast cell reaction.
- Short identical stretches of 6-8 contiguous amino acids;
- larger stretches (80 amino acids long) containing a minimum
of 35% (non contiguous) identical amino acids.
To search for the two types of similarities, a recent Expert
Consultation of the FAO/WHO, which was held in preparation of the Codex
alimentarius guidelines, devised the following procedure :
6.1. Sequence Homology as
Derived from Allergen Databases
The commonly used protein databases (PIR, SwissProt and TrEMBL) contain
the amino acid sequences of most allergens for which this information
is known. However, these databases are currently not fully up-to-date.
A specialized allergen database is under construction.
Suggested procedure on how to determine the percent amino acid identity
between the expressed protein and known allergens.
Step 1: obtain the amino acids sequences of all allergens in the
protein databases (for SwissProt and TrEMBL: see http://expasy.ch/tools; for PIR see http://wwwnbrf.georgetown.edu/pirwww
) in FASTA-format (using the amino acids from the mature proteins only,
disregarding the leader sequences, if any). Let this be data set (1).
Step 2: prepare a complete set of 80-amino acid length sequences
derived from the expressed protein (again disregarding the leader
sequence, if any). Let this be data set (2).
Step 3: go to EMBL internet address: http://www2.ebi.ac.uk
and compare each of the sequences of the data set (2) with all
sequences of data set (1), using the FASTA program on the web site for
alignment with the default settings for gap penalty and width.
Cross-reactivity between the expressed protein and a known allergen (as
can be found in the protein databases) has to be considered when there
1) more than 35 % identity in the amino acid sequence of the expressed
protein (i.e. without the leader sequence, if any), using a window of
80 amino acids and a suitable gap penalty (using Clustal-type alignment
programs or equivalent alignment programs)
2) identity of 6 contiguous amino acids.
If any of the identity scores equals or exceeds 35 %, this is
considered to indicate significant homology within the context of this
assessment approach. The use of amino acid sequence homologies to
identify prospective cross-reacting allergens in genetically modified
foods has been discussed in more detail elsewhere (Gendel, 1998a;
The search facility on the Allermatchtm webtool
automatically carries out
the procedure recommended by the guidelines on protein sequences that
are entered by the user in FASTA format (one-letter code
without residue numbers, see example sequence below). The user has
the option to select the following outputs of interest:
- Alignment of 80-amino acids subsequences of the input
using a sliding window of 80-amino acids size. The step size is 1 amino
acid, such that from a sequence of 100 amino acids, for example, 21
subsequences of 80 amino acids length are made (1-80, 2-81, 3-82 ...
20-99, 21-100). Each of these subsequences is aligned to database
sequences. The FASTA computer algorithm is used for these sequence
alignments, as recommended (see above; default values are used). With
FASTA, "head to tail" alignments (from the start to the end of a
sequence) are made of each subsequence with each database sequence. The
default threshold for the number of identical amino acids is 35% in the
alignment with an 80-amino acids window, which is considered a
significant level of identity between the input sequence and the
allergenic protein's sequence (see recommendations cited above). The
identity presented by the website in the results of the alignments is
therefore the % identical amino acids in the 80-amino acids window. The
default threshold can be changed by the user. Input sequences shorter
than 80 amino acids should not be aligned using this option.
- Full alignment of the whole input sequence with database
sequences using the FASTA algorithm. This option can be used, for
example, for input sequences shorter than 80 amino acids, for which the
option of the 80-amino acids sliding window (see above) cannot be used.
Also in case where an input sequence shows sufficient identity with
many proteins over its entire sequence, this option may provide for a
good oversight of the alignments between the input- and database-
- Exact hits of short identical stretches of, for example, 6
acids. To this end, a wordmatch algorithm is used, which searches for
identical matches of a specified number of contiguous amino acids
("wordlength") between the input sequence and a given database
sequence. The default value for the wordlength, which can be changed by
the user, is 6 amino acids. Decreasing the wordlength likely results in
a larger number of positive scores, while increasing it may yield less
The entered sequences will be compared to the sequences of allergenic
proteins compiled in the database. These sequences of allergenic
proteins have been extracted from protein databases.
Putative signal-, pro-, and transit-peptides, whose positions are indicated by the protein source database accession as
"features", have been removed from these sequences, which yields the
sequences of "mature" proteins.
Positive results of the analysis will be provided to the user.