Introduction

In most nations, genetically engineered foods must be assessed for their safety before market approval is granted. An important issue in this safety assessment is the potential allergenicity of transgenic ("foreign") proteins that have been introduced into the food by genetic engineering. In other words, what is the chance that the foreign protein may cause allergic reactions after consumption of the genetically engineered food containing this protein?

Potential allergenicity is assessed during a step-by-step procedure described by the guidelines of the FAO/WHO Codex alimentarius Commission for the safety assessment of foods derived from genetically engineered plants and micro-organisms [1]. One important step in this procedure is to determine, with the aid of computer programs, whether the primary structure (amino acid sequence) of the transgenic protein is similar to sequences of allergenic proteins, of which the latter are available from public protein sequence databases.

Two types of similarity are searched for:
The similar stretches that are identified this way may harbour potential binding sites (called epitopes) for IgE antibodies. IgE antibodies are allergy-related and involved in the binding of the allergen to mast cells, after which these cells release compounds, such as histamine, that cause the symptoms of allergy. Allergens must at least contain two IgE-binding epitopes to trigger a mast cell reaction.

To search for the two types of similarities, a recent Expert Consultation of the FAO/WHO, which was held in preparation of the Codex alimentarius guidelines, devised the following procedure [2]:

6.1. Sequence Homology as Derived from Allergen Databases

The commonly used protein databases (PIR, SwissProt and TrEMBL) contain the amino acid sequences of most allergens for which this information is known. However, these databases are currently not fully up-to-date. A specialized allergen database is under construction.

Suggested procedure on how to determine the percent amino acid identity between the expressed protein and known allergens.

Step 1: obtain the amino acids sequences of all allergens in the protein databases (for SwissProt and TrEMBL: see http://expasy.ch/tools; for PIR see http://wwwnbrf.georgetown.edu/pirwww ) in FASTA-format (using the amino acids from the mature proteins only, disregarding the leader sequences, if any). Let this be data set (1).

Step 2: prepare a complete set of 80-amino acid length sequences derived from the expressed protein (again disregarding the leader sequence, if any). Let this be data set (2).

Step 3: go to EMBL internet address: http://www2.ebi.ac.uk and compare each of the sequences of the data set (2) with all sequences of data set (1), using the FASTA program on the web site for alignment with the default settings for gap penalty and width.

Cross-reactivity between the expressed protein and a known allergen (as can be found in the protein databases) has to be considered when there is:

1) more than 35 % identity in the amino acid sequence of the expressed protein (i.e. without the leader sequence, if any), using a window of 80 amino acids and a suitable gap penalty (using Clustal-type alignment programs or equivalent alignment programs)

or:

2) identity of 6 contiguous amino acids.

If any of the identity scores equals or exceeds 35 %, this is considered to indicate significant homology within the context of this assessment approach. The use of amino acid sequence homologies to identify prospective cross-reacting allergens in genetically modified foods has been discussed in more detail elsewhere (Gendel, 1998a; Gendel, 1998b).

The search facility on the Allermatchtm webtool automatically carries out the procedure recommended by the guidelines on protein sequences that are entered by the user in FASTA format (one-letter code without residue numbers, see example sequence below). The user has the option to select the following outputs of interest:

  1. Alignment of 80-amino acids subsequences of the input sequence using a sliding window of 80-amino acids size. The step size is 1 amino acid, such that from a sequence of 100 amino acids, for example, 21 subsequences of 80 amino acids length are made (1-80, 2-81, 3-82 ... 20-99, 21-100). Each of these subsequences is aligned to database sequences. The FASTA computer algorithm is used for these sequence alignments, as recommended (see above; default values are used). With FASTA, "head to tail" alignments (from the start to the end of a sequence) are made of each subsequence with each database sequence. The default threshold for the number of identical amino acids is 35% in the alignment with an 80-amino acids window, which is considered a significant level of identity between the input sequence and the allergenic protein's sequence (see recommendations cited above). The identity presented by the website in the results of the alignments is therefore the % identical amino acids in the 80-amino acids window. The default threshold can be changed by the user. Input sequences shorter than 80 amino acids should not be aligned using this option.
  2. Full alignment of the whole input sequence with database sequences using the FASTA algorithm. This option can be used, for example, for input sequences shorter than 80 amino acids, for which the option of the 80-amino acids sliding window (see above) cannot be used. Also in case where an input sequence shows sufficient identity with many proteins over its entire sequence, this option may provide for a good oversight of the alignments between the input- and database- sequences.
  3. Exact hits of short identical stretches of, for example, 6 amino acids. To this end, a wordmatch algorithm is used, which searches for identical matches of a specified number of contiguous amino acids ("wordlength") between the input sequence and a given database sequence. The default value for the wordlength, which can be changed by the user, is 6 amino acids. Decreasing the wordlength likely results in a larger number of positive scores, while increasing it may yield less positive results.

The entered sequences will be compared to the sequences of allergenic proteins compiled in the database. These sequences of allergenic proteins have been extracted from protein databases. Putative signal-, pro-, and transit-peptides, whose positions are indicated by the protein source database accession as "features", have been removed from these sequences, which yields the sequences of "mature" proteins.

Positive results of the analysis will be provided to the user.