A new system for submission of nucleotide sequence data to the EMBL Data Library.

Plant Molecular Biology 11:541-550 © Kluwer Academic Publishers, Dordrecht - Printed in the Netherlands

A n e w system for submission of nucleotide sequence data to the E M B L Data Library

Patricia Kahn, David Hazledine and Graham Cameron

EMBL Data Library European Molecular Biology Laboratory Postfach 10.2209 D-6900 Heidelberg Federal Republic of Germany

541

542 The editors of Plant Molecular Biology have decided to adopt a policy that manuscripts reporting or analysing primary nucleotide sequence data will be considered for publication only when accompanied by evidence that these data have been deposited with the nucleotide sequence database maintained by the Data Library at the European Molecular Biology Laboratory (EMBL). So that this policy can be implemented with a minimum of inconvenience and delay to authors, the EMBL Data Library has established procedures for rapidly processing data submitted by researchers and for providing them with evidence of deposition. Here we describe this system and what it implies for authors who submit manuscripts to Plant Molecular Biology. We also describe a new facility for making these data available to researchers via computer networks at or near the time of publication. This paper is based on descriptions of the new data submission scheme published previously in Nucleic Acids Research (EMBL and GenBank® staffs; 1987; Kahn and Hazledine, 1988), the first journal to adopt the scheme. 1 WHY A NEW SCHEME? The rate at which nucleotide sequence data are being generated worldwide is increasing dramatically, a trend which will undoubtedly continue. There are presendy three major nucleotide sequence databases world-wide (EMBL Data Library, Federal Republic of Germany; GenBank®, USA, and the DNA Data Bank of Japan) and these groups collaborate in entering and exchanging data. Despite this collaborative arrangement, the enormous volume of data is in danger of rendering current processing systems inadequate. Furthermore, it is becoming increasingly difficult for the data bank staffs to maintain the necessary biological expertise to abstract the relevant information from all published articles. The volume of sequence data is also problematic for scientific journals, which are becoming increasingly unable and/or unwilling to print these data. At the same time, it is clear that access to a complete, up-to-date collection of primary sequence data is essential for research in virtually all areas of molecular biology, biotechnology and microbial and genetic diagnostics. It is therefore essential that all sequences which have been determined in laboratories around the world are rapidly made available to the research community. Since the databases currently collect much of their data by scanning journals for papers containing primary sequences, the trend against publication of these data must be accompanied by the establishment of alternative mechanisms for collecting and annotating (describing) them. Solutions to these problems will require, among other things, a radical revision of the mechanisms by which sequence data and related information enter the databases. The present data collection mechanisms need to be replaced by ones in which data capture is uncoupled from publication and researchers take a more active role in preparing their data for inclusion in the databases. The database staffs have devoted significant efforts to finding ways of encouraging researchers to submit data directly to the data banks. An important step in this direction was taken in the latter half of 1986 when a number of journals began distributing data submission forms to authors of manuscripts containing nucleotide sequence data. The form requested that authors send their sequence along with relevant annotation to EMBL or GenBank®, and gave instructions about how to do so. More recently, this system has been extended so that it encompasses amino acid sequence data and so that data is shared with the major protein sequence databases (Protein Identification Resource, USA; Martinsried Institute for Protein Sequences, Federal Republic of Germany; and the International Protein Information Database in Japan). A common data submission form is used by all the participating databases. The result has been a large increase in the number of researchers who submit data to the data banks, but this scheme is clearly insufficient in the long run. Compliance by authors is voluntary and therefore many do not respond; furthermore, only some of the submitted sequence data are in computer-readable form, the remainder being computer printouts. The latter must be typed into the data bank's computer, despite the fact that they clearly already reside in the researcher's computer. This additional work introduces time delays as well as the possibility of errors. However, this system was not conceived as an end in itself but as a first step in a broad revision of data capture mechanisms. The next step is to ensure that all primary sequonce data relevant to a

543 publication reach the database, regardless of whether a journal actually prints them. Requiring authors to submit their sequence data to the databases before the corresponding manuscripts will be considered for publication is an important step towards achieving this objective. 2 HOW THE NEW SYSTEM WORKS 2.1 Overview As of 1 January 1989 Plant Molecular Biology will require that all primary sequence data reported or referred to in submitted manuscripts have a corresponding database accession number. An accession number, which permanently identifies a sequence (or set of sequences) in the database, constitutes evidence that the data have been submitted to and accepted by the EMBL Data Library. This means that authors who intend to submit manuscripts reporting sequences should first send the data to the Data Library. (The data will, as usual, be shared with the other major nucleotide and protein sequence databases but should be submitted to EMBL.) The Data Library staff will assign an accession number to data which appear to be complete and accurate, and will inform the authors promptly what this number is. When authors submit their manuscripts to Plant Molecular Biology they should inform the journal what accession number(s) their data have been assigned. The following is a set of instructions describing how researchers can submit their data to the EMBL Data Library and obtain an accession number as quickly as possible. 2.1 The first step in getting an accession number Before doing anything else, authors should get a copy of a sequence data submission form. This form solicits all of the information needed to make a database entry; that is, the primary sequence data together with descriptive information such as the source of the sequenced segment (e.g., organism, strain, tissue) and the location of interesting regions within the sequence (e.g., coding regions, regulatory signals). It also contains information about data formats. The data submission form exists in both a paper and a computer-readable version; the latter can be completed using a text editor. These versions are available from the following sources: (a) Paper form: printed at the end of this article and available upon request from EMBL and GenBank®. (b) Computer-readable form: (1) With all releases of the EMBL and GenBank® databases since the beginning of 1987.

(2) From EMBL by electronic mail (computer network) via our file server. Anyone with access to BITNET/EARN (either directly or via a gateway) can send a request to the EMBL file server, which will automatically return a copy of the data submission form by electronic mall. Instructions for using the EMBL file server are given in Appendix I,

(3) From EMBL, on Macintosh (3-1/2") or MS-DOS (5-1/4") floppy diskettes. Complete information on how to contact the EMBL Data Library is given in Appendix II. (4) From GenBank® via electronic mail or on floppy diskette. Addresses and phone numbers for GenBank® are given in Appendix II. For information on requesting the form from GenBank® via Telenet, contact David Benton (+1-415-962-7360). (5) Through the BIONET National Computer Resource for Molecular Biology (Mountain View, California, U.S.A.). Anyone who is already a subscriber to

544 BIONET can get information about submitting sequences by typing HELP SUBMIT-SEQ at the system prompt. People who do not have accounts should contact Kathy Berg (+1-415-962-7337). BIONET provides the XGENPUB program, which assists users in filling out the form on-line and then automatically mails the completed form and corresponding sequence data to the EMBL, GenBank and PIR (Protein Identification Resource, Washington, D C, U.S.A.) databases. 2.2 What to submit to the EMBL Data Library A data submission should include the following (for further details, see the data submission form

itself): (a) the sequence itself, in computer-readable form (computer network mail, magnetic tape or MS-DOS or Macintosh floppy diskette). Printouts will be accepted only if the authors have no access to a computer. (b) a completed data submission form for each submitted sequence. The form is available from the sources listed in section 2.1. (c) a computer network address, a telex number or a telefax number (advisable, to help speed things up, but not required). 2.3 How to send data to the EMBL Data Library Data can be sent to the Data Library in one of several ways:

(a) Electronic file transfer: files can be sent via computer network to [email protected]. This BITNET/EARN address can be reached directly (by people at BITNET/EARN sites) or via various gateways from Arpanet, Usenet, JANET, etc. Ask your local network expert for help or phone us (+49-6221-387-258). Alternatively, data submitted through the XGENPUB program on BIONET is automatically forwarded by computer network to the EMBL Data Library, so the author does not have to worry about how to do this. (b) Telefax to Data Submissions, EMBL Data Library. Our fax number is: +49-6221-387306. (c) Normal post. See address given in Appendix II. 2.4 How long will it take to get an accession number? We will process data submissions within 7 working days of receipt and send authors notification of either what accession number(s) they have been assigned or what additional information is needed. There are several things authors can do to minimise the time it takes to get an accession number: (a) Be sure that submissions include all the necessary materials and that all relevant questions on the data submission form have been answered. (b) Check the data to be sure that they do not contain inconsistencies/errors (e.g., a stop codon in the middle of a region listed on the form as an exon). (c) Be sure to include either a computer network address or a telex or telefax number. If this information is not provided, notification of accession numbers will be sent by regular

545 post. Telephoning is costly and time-consuming, and the Data Library will therefore n o t attempt to contact authors by phone. Although we will process data submissions as quickly as we can, we strongly encourage authors to submit their data at or before the time they begin writing the manuscript, rather than once it is finished. This way we can process the data while the manuscript is being written, and authors will not have to delay submission of their manuscript while they wait for notification of their accession number. It should be emphasised that a u t h o r s are responsible for communicating their accession number(s) to the journal at the time they submit their manuscript; the Data Library will not contact the journal. 2.5 Data security The data submission form asks authors whether their submitted data can be made available to the public immediately or whether it should be withheld until publication.

APPENDIX I. E M B L NETWORK FILE SERVER Computer users with access to BITNET/EARN (directly or via a gateway) can obtain copies of the data submission form, or of database entries, by sending commands to a file server running on the VAXcluster at EMBL. The file server facility is provided free of charge, though users may have to meet some or all of the communication costs, depending on the accounting system of their local computer service. To use this facility, send file server c o m m a n d s (as electronic mail) to the address [email protected]. Each line of the mail message should consist of a single file server command, and nothing else. The mall can be sent over BITNET/EARN, or from any other network which has a gateway into BITNET/EARN (e.g., JANET in the UK or A R P A N E T in the USA). The most important file server command, to get users started, is HELP. If the file server receives this command, it will return a help file to the sender, explaining in some detail how to use the facility. In order to send electronic mail to a BITNET/EARN address, users must find out which command they have to use on their own local machine and how they should format the address [email protected]. Users who don't already know how to do this should contact their local computer service, or if all else fails, contact the Data Library and we will do our best to help. Below are some examples which illustrate how to send commands to the file server using a VAX/VMS system that is a BITNET/EARN node running JNET software. To send a HELP command to the file server, you could use the operating system command MAIL as follows: $ MAIL "JNET%""NETSERV@EMBL ...... where is the name of a file containing file server commands. To request help information the file should contain the following command: HELP To request a copy o f the data submission form, it should contain the following G E T command:

546 GET DATALIB :DATASUB.TXT Users can also request specific sequences via the File Server. Information on how to do this is provided in the HEI~P file.

APPENDIX II. HOW TO CONTACT THE NUCLEOTIDE SEQUENCE DATABASES EMBL Data Library: (a) Computer network: [email protected] (for data submissions); [email protected] (for questions requiring a personal response) (b) Postal address: Data Submissions, EMBL Data Library, Postfach 10.2209, 6900 Heidelberg, Federal Republic of Germany (c) Telephone: +49-6221-387-258 (d) Telefax:

+49-6221-387-306

(e) Telex:

461613 (embl d)

GenBank®: (a) Computer network address: [email protected] (b) Postal address: GenBank® Submissions, Mail Stop K710, Los Alamos National Laboratory, Los Alamos, NM 87545, USA (c)

Telephone: +1-505-665-2177

REFERENCES

EMBL and GenBank® staffs (1987). A new system for direct submission of data to the nucleotide sequence databases. Nucl. Acids Res. 15 (18). Kahn, P. and Hazledine, D. (1988). NAR's new requirement for data submission to the EMBL Data Library: information for authors. Nucl. Acids Res. (in press).

New services of the EMBL Data Library.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

The EMBL Data Library.

The EMBL data library.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.

New nucleotide sequence data on the EMBL File Server.