2423 Symbols and Format To Be Used for Nucleotide and/or Amino Acid Sequence Data for WIPO ST.25 [R-07.2022]
[Editor Note: This section is not applicable to applications filed on or after July 1, 2022, having disclosures of nucleotide and/or amino acid sequences as defined in 37 CFR 1.831(b). See MPEP §§ 2412-2419 for guidance on WIPO ST.26 requirements for applications filed on or after July 1, 2022.]
37 CFR 1.822 Symbols and format to be used for nucleotide and/or amino acid sequence data.
- (a) The symbols and format to be used for nucleotide and/or amino acid sequence data shall conform to the requirements of paragraphs (b) through (e) of this section.
- (b) The code for representing the nucleotide and/or amino acid sequence characters shall conform to the code set forth in appendices A and C to this subpart. No code other than that specified in these sections shall be used in nucleotide and amino acid sequences. A modified base or modified or unusual amino acid may be presented in a given sequence as the corresponding unmodified base or amino acid if the modified base or modified or unusual amino acid is one of those listed in appendices B and D to this subpart, and the modification is also set forth in the Feature section. Otherwise, each occurrence of a base or amino acid not appearing in appendices A and C, shall be listed in a given sequence as “n” or “Xaa,” respectively, with further information, as appropriate, given in the Feature section, by including one or more feature keys listed in appendices E and F to this subpart.
- Note 1 to paragraph (b): Appendices A through F to this subpart contain Tables 1– 6 of the World Intellectual Property Organization (WIPO) Handbook on Industrial Property Information and Documentation, Standard ST.25: Standard for the Presentation of Nucleotide and Amino Acid Sequence Listings in Patent Applications (2009).
- (c) Format representation of nucleotides.
- (1) A nucleotide sequence shall be listed using the lowercase letter for representing the one-letter code for the nucleotide bases set forth in appendix A to this subpart.
- (2) The bases in a nucleotide sequence (including introns) shall be listed in groups of 10 bases except in the coding parts of the sequence. Leftover bases, fewer than 10 in number, at the end of noncoding parts of a sequence shall be grouped together and separated from adjacent groups of 10 or 3 bases by a space.
- (3) The bases in the coding parts of a nucleotide sequence shall be listed as triplets (codons). The amino acids corresponding to the codons in the coding parts of a nucleotide sequence shall be listed immediately below the corresponding codons. Where a codon spans an intron, the amino acid symbol shall be listed below the portion of the codon containing two nucleotides.
- (4) A nucleotide sequence shall be listed with a maximum of 16 codons or 60 bases per line, with a space provided between each codon or group of 10 bases.
- (5) A nucleotide sequence shall be represented, only by a single strand, in the 5 to 3 direction, from left to right.
- (6) The enumeration of nucleotide bases shall start at the first base of the sequence with number 1. The enumeration shall be continuous through the whole sequence in the direction 5 to 3. The enumeration shall appear in the right margin, next to the line containing the one-letter codes for the bases and giving the number of the last base of that line.
- (7) For those nucleotide sequences that are circular in configuration, the enumeration method set forth in paragraph (c)(6) of this section remains applicable with the exception that the designation of the first base of the nucleotide sequence may be made at the option of the applicant.
- Note 2 to paragraph (c): Appendices A through F to this subpart contain Tables 1– 6 of the World Intellectual Property Organization (WIPO) Handbook on Industrial Property Information and Documentation, Standard ST.25: Standard for the Presentation of Nucleotide and Amino Acid Sequence Listings in Patent Applications (2009).
- (d) Representation of amino acids.
- (1) The amino acids in a protein or peptide sequence shall be listed using the three-letter abbreviation, with the first letter as an upper case character, as in Appendix C to this subpart.
- (2) A protein or peptide sequence shall be listed with a maximum of 16 amino acids per line, with a space provided between each amino acid.
- (3) An amino acid sequence shall be represented in the amino to carboxy direction, from left to right, and the amino and carboxy groups shall not be represented in the sequence.
- (4) The enumeration of amino acids may start at the first amino acid of the first mature protein, with the number 1. When represented, the amino acids preceding the mature protein, (e.g., pre‑sequences, pro-sequences, pre‑pro-sequences, and signal sequences) shall have negative numbers, counting backwards starting with the amino acid next to number 1. Otherwise, the enumeration of amino acids shall start at the first amino acid at the amino terminal as number 1, and shall appear below every five amino acids of the sequence. The enumeration method for amino acid sequences that is set forth in this section remains applicable for amino acid sequences that are circular in configuration, with the exception that the designation of the first amino acid of the sequence may be made at the option of the applicant.
- (5) An amino acid sequence that contains internal terminator symbols (e.g., “Ter,” “*,” or “.,” etc.) may not be represented as a single amino acid sequence but shall be represented as separate amino acid sequences.
- Note 3 to paragraph (d): Appendices A through F to this subpart contain Tables 1– 6 of the World Intellectual Property Organization (WIPO) Handbook on Industrial Property Information and Documentation, Standard ST.25: Standard for the Presentation of Nucleotide and Amino Acid Sequence Listings in Patent Applications (2009).
- (e) A sequence with a gap or gaps shall be represented as a plurality of separate sequences, with separate sequence identifiers (§ 1.823(a)(5)), with the number of separate sequences being equal in number to the number of continuous strings of sequence data. A sequence composed of one or more noncontiguous segments of a larger sequence or segments from different sequences shall be presented as a separate sequence.
Appendices A through F referenced in 37 CFR 1.822 are reproduced in MPEP § 2422(I).
2423.01 Format and Symbols To Be Used in a “Sequence Listing” [R-07.2022]
[Editor Note: This section is not applicable to applications filed on or after July 1, 2022, having disclosures of nucleotide and/or amino acid sequences as defined in 37 CFR 1.831(b). See MPEP §§ 2412-2419 for guidance on WIPO ST.26 requirements for applications filed on or after July 1, 2022.]
37 CFR 1.822 sets forth the format and symbols to be used for listing nucleotide and/or amino acid sequence data. The symbols for representing the nucleotide and/or amino acid characters in the sequences are set forth in Appendices A and C to Subpart G of Part 1 of the CFR. See MPEP § 2422(I). No other symbols shall be used in nucleotide and amino acid sequences. The “modified base” and “modified and unusual amino acid” symbols appearing in Appendices B and D to Subpart G of Part 1 of the CFR (see 37 CFR 1.822 and MPEP § 2422(I)) are not to be set forth in the sequences recited in the "Sequence Listing”. However, “modified base” or “modified and unusual amino acid” symbols may be used in the written description and/or drawing portions of the specification. To properly enter notations for modified bases or amino acids in the “Sequence Listing”, the Feature section of the “Sequence Listing” should be used. That is, a modified base or amino acid may be presented in a given sequence as the corresponding unmodified base or amino acid if the modified base or amino acid is one of those listed in Appendices B and D to Subpart G of Part 1 of the CFR and the modification is also set forth in the Feature section of the “Sequence Listing”. Otherwise, all nucleotide bases or amino acids not appearing in Appendices A and C to Subpart G of Part 1 of the CFR must be listed in a given sequence as “n” or “Xaa,” respectively, with further information given in the Feature section of the “Sequence Listing” by including one or more feature keys listed in Appendices E and F to Subpart G of Part 1 of the CFR. See 37 CFR 1.822(b).
In 37 CFR 1.822(b) and 37 CFR 1.822(d), the use of three-letter symbols for amino acids is required in the “Sequence Listing”. The three-letter symbols must be presented using the upper case for the first character and lower case for the remaining two characters. Applicants are encouraged to use the three-letter symbols for amino acids throughout the disclosure, instead of the one-letter symbols, for easier reading of the application and any patent issuing therefrom.
37 CFR 1.822(c) through (e) set forth the format for presenting sequence data. These paragraphs set forth the manner in which the characters in sequences are to be grouped, spaced, presented and numbered.
2423.02 Depiction of Coding Regions [R-07.2022]
[Editor Note: This section is not applicable to applications filed on or after July 1, 2022, having disclosures of nucleotide and/or amino acid sequences as defined in 37 CFR 1.831(b). See MPEP §§ 2412-2419 for guidance on WIPO ST.26 requirements for applications filed on or after July 1, 2022.]
If applicant chooses to depict coding regions, 37 CFR 1.822(c)(3) requires the amino acids corresponding to the codons in the coding parts of a nucleotide sequence to be listed immediately below the corresponding codons. Further, in 37 CFR 1.822(c)(3), the situation in which a codon spans an intron has been addressed. In those situations, the “amino acid symbol shall be listed below the portion of the codon containing two nucleotides.” This requirement clarifies the representation of an amino acid that corresponds to a codon that spans an intron.
It should be noted that the sequence rules do not, in any way, require the depiction of coding regions or the amino acids corresponding to the codons in those coding regions. 37 CFR 1.822(d) only requires that where amino acids corresponding to the codons in the coding parts of a nucleotide sequence are depicted, they must be depicted below the corresponding codons. There is absolutely no requirement in the rules to depict coding regions. However, when the coding parts of a nucleotide sequence and their corresponding amino acids have been enumerated by their residues, those amino acids must also be set forth as a separate sequence if the amino acid sequence meets the length thresholds in 37 CFR 1.821(a).
2423.03 Presentation and Numbering of Sequences [R-07.2022]
[Editor Note: This section is not applicable to applications filed on or after July 1, 2022, having disclosures of nucleotide and/or amino acid sequences as defined in 37 CFR 1.831(b). See MPEP §§ 2412-2419 for guidance on WIPO ST.26 requirements for applications filed on or after July 1, 2022.]
37 CFR 1.822(c)(5) provides that nucleotide sequences shall only be represented by a single strand, in the 5′ to 3′ direction, from left to right. That is, double stranded nucleotides shall not be represented in the “Sequence Listing”. A double stranded nucleotide may be represented as two single stranded nucleotides, and any relationship between the two may be shown in the drawings.
The procedures for presenting and numbering amino acid sequences are set forth in 37 CFR 1.822(d). Two alternatives are presented for numbering amino acid sequences. Amino acid sequences may be numbered with respect to the identification of the first amino acid of the first mature protein or with respect to the first amino acid appearing at the amino terminal. The numbering procedure for nucleotides is set forth in 37 CFR 1.822(c)(6). Sequences that are circular in configuration are intended to be encompassed by these rules, and the numbering procedures described above remain applicable with the exception that the designation of the first nucleotide base or amino acid of the sequence may be made at the option of the applicant. See 37 CFR 1.822(c)(7) and (d)(4).
In 37 CFR 1.822(e) the procedures for presenting and numbering hybrid and gapped sequences are set forth. A sequence with a gap or gaps shall be presented as a plurality of separate sequences, each having separate sequence identifiers, with the number of separate sequences being equal in number to the number of continuous strings of sequence data. The term “gap” is not intended to embrace a gap or gaps that is/are introduced into the presentation of otherwise continuous sequence information in, e.g., a drawing figure, to show alignments or similarities with other sequences. The “gaps” referred to in this section are gaps representing unknown or undisclosed regions in a sequence between regions that are known or disclosed. On the other hand, a sequence that contains one or more regions of contiguous “n” or “Xaa” residues, wherein the exact number of “n” or “Xaa” residues in each region is disclosed, must be included in the “Sequence Listing” as a single sequence with a single sequence identifier. A sequence disclosed by enumeration of its residues that is constructed as a single continuous sequence from one or more non-contiguous segments of a larger sequence or segments from different sequences must be included in the “Sequence Listing” as a single sequence with a single sequence identifier. A fragment of a larger sequence need not be enumerated by its residues, and may be referred to in the specification, claims or drawings as, e.g., “residues 2 through 33 of SEQ ID NO:12,” assuming that SEQ ID NO:12 has been properly included in the “Sequence Listing”.