A standard file format for data from DNA sequencing instruments.

0 1 Y Y 2 Harwood Academic Publishers CmbH

DNA Scquenc~-/ D N A Sequencing and Mdpprng, Vol 3, pp 107-1 10 Reprints av,iilable directly from the publisher Photocopying pcrniitted by license only

Printed in the United Kingdom

SHORT COMMUNICATION

A standard file format for data from DNA sequencing instruments Mitochondrial DNA Downloaded from informahealthcare.com by University of Newcastle on 01/09/15 For personal use only.

SIMON DEAR and RODGER STADEN MRC Laboratory ofMolecular Biology, Hills Road, Cambridge CBZ 2QH, UK

There are now a number of machines for determining DNA sequences. These devices are currently of two types: those such as the Applied Biosystems 373A and the Pharmacia A.L.F. which interpret the sequences of samples as they run on gels within the machine, and those, such as the Bio-Rad and Amersham readers that scan and analyse conventional autoradiographs. Both types of machine can produce their data in the form of traces which represent the band intensity of each of the four base types at each position in the sequence. At present all the machines write files in different formats. We describe a machine independent format for storing data derived from automatic sequencing machines. Files in this format can store the derived sequence, the traces and a set of confidence measures for each base. We have adopted the format as the standard for our sequence handling software.

mation directly from the original files produced by the two machines and consequently needs to know how each writes their data. For our purposes these files are unnecessarily large, containing far more data than is needed for the display of traces. In addition new instruments have been developed (for example by Bio-Rad and Millipore) and there i s interest in using our software with the data they produce. In their respective ways the instruments produce measurements of the fluorescence or radioactivity at sample points along the gel or film. When these sample values are plotted for each of the four base types we see the, now familiar, trace representations of the band intensities. The analytical software provided with each instrument interprets the data and assigns base symbols to positions along the sampled points. In theory such software could also attribute a probability or confidence value to each assigned base.

KEY WORDS: DNA sequencing instruments, file forrndt

INTRODUCTION One of the important components in efforts to increase the rate at which data are gathered for genome projects is the use of machines to derive the sequence. A valuable, and natural, outcome of the use of these -devices i s the production of machine readable data. We describe a machine independent format for storing data derived from automatic sequencing instruments. At present, we believe, to get the maximum information from the current machines and to resolve disagreements between readings, it is essential for human users to be able to view the trace representation of the data. In an earlier paper (Dear and Staden, 1991) we described a program that could show the aligned sequences in a contig, and simultaneously display the trace representations of sequences derived from the ABI and Pharniacia machines. The program reads the infor-

FILE DESIGN In designing the format we wanted to minimise the file size while trying to preserve generality and flexibility. We record the following information.

1. The sample values for each of the four base types. To reduce the file size we use 8 bit integers, and we therefore normalise the values to lie in the range 0 to 255. This i s a lower precision than used internally by the instruments but i s more than adequate for visual inspection. 2. An index into the sample values for each called base to identify its position. 3. The probability or confidence value for each 107

S . DEAR AND R. STADEN

1 . The Header Record

The file begins with a 128 byte header record that describes the location and slze of the chromatogram data in the file. Nothing is implied about the order in which the components (samples, sequence and comments) appear.

Mitochondrial DNA Downloaded from informahealthcare.com by University of Newcastle on 01/09/15 For personal use only.

/ * Type definition for the Header structure * /

typedef struct { long magic-number; long sample; long samples-offset; long bases; long bases-left-clip; long bases-right-clip; long bases-of fset; long comments-size; long comments-offset; long spareI231; ) Header;

/ * Always [ ( ( ' . ' < < 8 ) + ' ~ ' < < 8 ) + ' ~ ' < < 8 ) + '*f/' ; / * Number of elements in Samples matrix * I / * Byte offset from start of file * I / * Number of bases in Bases matrix * / / * Number of bases in left clip (vector)*/ / * Number of bases in right clip (unreliable) ' I / * Byte offset from start of file * / / * Number of bytes in Comment section * / / * Byte offset from start of file * / I * Unused * I

2 . The Samp1.e Points.

'The trace information is stored at byte offset Header.samples-offset from the start- of the file. For each sample point there are values for each of the four b a s e s . The scaling factor used can be stored as a comment (See section 4 ) ) / - 'Yyps definition for the Sample data * / typpdef u n s tclned char byte;

tppcdef

5 t ruct { byte samplep; byte sample-C; byte sample-G; byte sample-T; Samples,

3

&:ie

/ * Sample / * Sample / * Sample / * Sample

for for for for

A trace C trace G trace T trace

*/ */ */

*/

sequence Information

t y p e d e f unsigned char byte; typerlef striirt (

1ong peak-i ndex ; byte prob-A; by t 13 p rob-C ; byt I? prob--G; byte prob-_T; char base; byte spare[3]; j

/ * Index into Samples matrix for base position * / / * Probability of it being an A * / / * Probability of it being an C * / / * Probability of it being an G * / / * Probability of it being an T * / I * Called base character */ / * Spare * /

Base;

4 . Comments.

Comments arls stored at offset Header.comrnents-offset from the start of the file A l l other information is stored in a section of the file as a zero-terminated A S C I I string, with entries separated by newline ['\n') characters. Each line ha:; the format: Field-ID=Value. The character string "Field-ID" can be any string, t.houyh several have special meaning and should be used. For example, the comments s e c t , i u n must include the following fields:

I I3 MACH

'TPSW BCSW DATF TIATN

XAFS Data Interchange: A single spectrum XAFS data file format.

CIF (Crystallographic Information File): A Standard for Crystallographic Data Interchange.

Data file standard for flow cytometry. Data File Standards Committee of the Society for Analytical Cytology.

Introduction to flow cytometry data file standard.

GBPARSE: a parser for the GenBank flat-file format with the new feature table format.

Proposal for a Standard Format for Neurophysiology Data Recording and Exchange.

NMR Exchange Format: a unified and open standard for representation of NMR restraint data.

A file format for the exchange of nuclear medicine image data: a specification of Interfile version 3.3.

DNA sequencing versus standard prenatal aneuploidy screening.

DNA sequencing versus standard prenatal aneuploidy screening.

DNA sequencing versus standard prenatal aneuploidy screening.

MOSAIC: a data model and file formats for molecular simulations.

The NeXus data format.

MMTF-An efficient file format for the transmission, visualization, and analysis of macromolecular structures.

Clinical genetics: DNA sequencing trumps standard screening tools.

Aptaligner: automated software for aligning pseudorandom DNA X-aptamers from next-generation sequencing data.

A standard format and a graphical user interface for spin system specification.

COMBINE archive and OMEX format: one file to share all information to reproduce a modeling project.

Pediatric therapeutic endoscopy using standard fiberoptic instruments.

Detection of structural DNA variation from next generation sequencing data: a review of informatic approaches.

Health Data File: Overview and methodology.

CRYSTMET-The NRCC Metals Crystallographic Data File.

Root canal shaping by single-file systems and rotary instruments: a laboratory study.

GLYDE-II: The GLYcan data exchange format.