MIQAS_TAB—Tab-delimited files describing a QTL or association study

version 1.0 (June 2008)

Introduction

The MIQAS_TAB set of file formats are designed to describe the results from a QTL or association study according to MIQAS minimal requirements. These files are all tab-delimited and provide an alternative solution for data exchange from MIQAS_XML, the XMLSchema provided with the MIQAS standard. Even though the information described here could be entered in different sheets of a spreadsheet application like Microsoft Excel or OpenOffice, the MIQAS_TAB specification explicitly refers to tab-delimited text files. One of the main reasons is the fact that spreadsheet programs can alter data in unexpected ways.

The different sections of the MIQAS standard are covered by one or more tab-delimited files. The main file, called experiment_description_file.txt, provides meta-data for the study and links the rest of the data into a coherent whole. When one submits data using MIQAS_TAB, the researcher sends a group of files and indicates which of these is the experiment description file. A system loading this data will use the information in that file to locate the rest of the data. An overview of the conceptual parts covered by the MIQAS_TAB files is presented in Figure 1.

The following files are used:

Some of these files are always required, while others are optional under certain circumstances. See the documentation for the experiment description file (below) for more information.

There are two alternative formats for these files: either they contain two columns (i.e. tag-value pairs) or are a grid of data.

The id field mentioned in several of the files acts as an internal reference within the group of files. No additional meaning should be attributed to it.


1. Experiment description file (required)

The experiment description file contains meta-data for the study and links to the rest of the data. Rows in this file normally contain 2 columns (with one exception), following a tag-value paradigm: the first column provides a tag while the second column is the value for that tag.

Data provided in this file:

At minimum, the specification version, a title and a contact email have to be provided, as well as names for some of the other files. Entries for related result contain an extra column indicating the type of relationship.

Files that must be provided are the experiment description file, the population description file, the trait file and the marker file. Some files are mandatory under certain circumstances: the qtl file and map file are mandatory if the experiment describes a QTL study. The association file is mandatory if the experiment describes an association study. In addition, the pedigree file is mandatory if either the genotype or phenotype file is present.

Typical issues with this file:

Example

specification version 1.0  
title Roslin Institute QTL study 123  
description this is what the study is about  
contact last name Aerts  
contact first name Jan  
contact email jan.aerts@myisp.com  
reference PMID 12345678  
related result ArkDB|ARKQTL00000001 fine_mapping
population description file population.txt  
pedigree file pedigree.txt  
trait description file traits.txt  
marker description file markers.txt  
genotype data file genotypes.txt  
phenotype data file phenotypes.txt  
map file map.txt  
results file results_first_analysis.txt  
results file results_second_analysis.txt  

2. Population description file (required)

The population description file contains information of the population. Like the experiment description file, the population description file contains two columns with tag-value pairs.

Data provided in this file:

Example

species Bos taurus
description BovGen – cross between extreme phenotypes for response for bacterial infection
source Holstein x Belgian Blue
structure intercross

3. Pedigree file (conditional)

The pedigree file contains information on the sex of the individuals and the relationships between them. This file is grid-based. The four columns contain the individual id (as used in the genotype and phenotype data files), the sex, the id of the father and the id of the mother. Only the first column is mandatory. The first row contains the headers: ”#individual_id”, “sex”, “father_id” and “mother_id”. Values for sex can be “UNKNOWN”, “MALE”, “FEMALE” or “MALE&FEMALE” (e.g. for oyster). Unknown values should be reported as “UNKNOWN”. Both the father_id and mother_id must be values that are also present in the individual_id column.

This file is mandatory if genotypes or phenotypes are provided (even if it just would contain a list of IDs). If neither the genotypes and phenotypes file is given, the pedigree file is still recommended.

Typical issues with this file:

Example

#individual_id sex father_id mother_id
ind_1 MALE UNKNOWN UNKNOWN
ind_2 FEMALE UNKNOWN UNKNOWN
ind_3 MALE ind_1 ind_2
ind_4 MALE ind_3 ind_2
ind_5 FEMALE ind_3 ind_2

4. Trait description file (required)

The trait description file contains all information necessary to describe the traits under investigation. Like the experiment description file, the trait description file contains two columns with tag-value pairs. More than one trait can be described by separating them with a blank line.

Data provided in this file (for each trait):

Typical issues with this file:

Example

id ISU1
name fat depth
description thickness of the fat in the back of the animal
ontology PATO|acc0000001
ontology Evo|acc999999
unit centimeter
   
id BW
name birth weight
description weight at birth
unit grams
   
id carcass_weight
name carcass weight
description weight of the animal after slaughter
unit grams

Three traits are listed in this example. It is made clear that the id for each of them can be any string and has no other meaning than to be a unique identifier that can be reference in the other MIQAS_TAB files, in particular the phenotypes and results files.


5. Marker description file (required)

The marker description file contains all information necessary to uniquely identify the genetic markers used in the study. Like the previous two files, this file has two columns. Information for different markers is separated by an empty line.

Data provided in this file:

Typical issues with this file:

Example

id CGA  
name CGA  
accession ArkdB|ARKMKR00001  
allele 0 unknown
allele 1 137 bp
allele 2 188 bp
allele 3 deletion
     
id S0082  
accession dbSNP|rs0002  
accession ArkDB|ARKMKR00002  
allele 0 unknown
allele 1 A
allele 2 G
     
id S0083  
accession dbSNP|rs00003  
allele 0 unknown
allele A A
allele G G
     
id S0085  
accession dbSNP|rs00004  
     
id SW1828  
accession dbSNP|rs00005  

In this example, the first marker will be referred to in the other files as CGA, i.e. the value for the id field. Four different alleles are described. The values 0 to 3 can be used in the genotype data file and refer to the actual alleles as provided in the third column. These values can e.g. be fragment lengths if the marker is a microsatellite. The second marker is a SNP and has three alleles. This marker has been defined in two different databases that each contain information on it. The third example marker shows that anything can be put in the second column of the allele fields: even the actual biological information. This means that the anonymous label “A” in the genotype.txt file for this marker actually means adenosine. Finally, the last two markers do not list any alleles. In this case, the alleles in the genotypes.txt file must be considered as just anonymous labels.


6. Genotype data file

The genotype data file contains the genotypes for all markers for all individuals. The first column contains the individual id (referring to the individual id in the pedigree file) and the other columns contain genotypes for the different markers. The first row is a header row and should consist of ”#individual_id” in the first cell, followed by marker ids. These marker ids should correspond to the id field in the marker description file; if the marker id cannot be found back in that file, the genotype file is invalid. The values for the genotype recordings must be present in the second column of the “allele” fields in the marker description file. The same marker id should appear once for each allele; for diploid organisms, there should therefore be two columns for each marker.

Typical issues with this file:

Example

#individual_id CGA CGA S0082 S0082 S0155 S0155 SW1301 SW1301
ind_1 1 4 4 7 1 2 1 2
ind_2 1 4 4 3 1 1 1 2
ind_3 0 0 0 0 1 3 1 4
ind_4 1 2 2 2 1 1 2 2
ind_5 1 1 1 2 3 3 2 3

7. Phenotype data file (optional)

The phenotype data file is similar to the genotype data file and contains the values for phenotype recordings of the individuals of the study. This file is also grid-based. The first column contains the individual id (referring to the individual id in the pedigree file) and the other columns contain values for the different traits. The first row is a header row and should consist of ”#individual_id” in the first cell, followed by trait ids. These trait ids should correspond to the id field in the trait description file; if the trait id cannot be found back in the trait description file, the phenotype file is invalid. The values for the phenotype recording must be in the unit as recorded in the trait description file.

Typical issues with this file:

Example

#individual_id BW carcass_weight ISU1
ind_1 67800 382100 19
ind_2 75400 283100 24
ind_3 69800 235500 29
ind_4 64800 318260 18
ind_5 65800 291700 19

8. Map file (conditional)

If the MIQAS experiment describes a QTL study, the genetic map has to be provided in the map file. This file, in contrast to the previous three files, contains a grid with the first column containing the marker identifiers (i.e. the value for id in the marker description file), the second column contains the linkage group and the third column contains the position in centiMorgan. The first row contains the headers: ”#marker_id”, “linkage_group” and “position”. For each linkage group, there should be a marker at zero centiMorgan.

Typical issues with this file:

Example

#marker_id linkage_group position
S0083 1 0
CGA 1 42
S0082 1 70
S0085 7 0
SW1828 7 106

9. Results file (required)

A results file contains all QTLs or associated markers that are identified within a single analysis. As with other tag-value files described above, different results (i.e. QTLs or associated markers) are separated by an empty line. The first block of rows describes the analysis and ends with a blank line. Subsequent lines report QTLs or associations, each separated with an empty line.

Data provided in this file:

Typical issues with this file:

Example

Results file A

id results_a
trait BW
trait carcass_weight
genetic_model additive
software GridQTL
approach least square regression interval mapping
assumption effect same in males as in females
fixed effect sex
trait covariate ISU1
   
qtl id qtl_1
trait BW
linkage group 1
region start 7.2
region stop 9.3
region peak 8.1
confidence interval start 5.2
confidence interval stop 13.1
confidence interval method bootstrapping
statistical significance P<0.05
proportion explained 19.1%
significance term genome-wide significant
significance criterion LOD > 3
   
qtl id qtl_2
trait BW
linkage group 1
region start 19.2
region stop 23.3
region peak 19.9
confidence interval start 15
confidence interval stop 28
confidence interval method bootstrapping
statistical significance P<0.05
proportion explained 5.4%
significance term chromosome-wide suggestive
significance criterion LOD > 3

Results file B

id results_b
trait BW
trait carcass_weight
genetic model additive
software GridQTL
approach least square regression interval mapping
fixed effect sex
trait covariate ISU1
   
qtl id qtl_3
trait BW
linkage group 1
region start 5.4
region stop 11.2
region peak 8.6
confidence interval start 3.1
confidence interval stop 14.2
confidence interval method bootstrapping
statistical significance P<0.05
proportion explained 14.1%
significance term genome-wide significant
significance criterion LOD > 3

Three QTLs are described in this example, separated into two files. The analysis for both files is exactly the same, except that the first analysis (in results_a) uses an assumption that is not present in the second analysis. Therefore, these must be separated into different files. The analysis part of these files should contain all traits that are included in the analysis, even if no QTLs or associated markers could be identified for them. Assumptions should be very short (e.g. “effect same for male and female”) and not contain multiple assumptions in one row. Multiple assumptions should be split over multiple rows.

Values for fixed effects and trait covariates are either (a) traits that are described in the trait description file (and linked through the id field), or (b) any header in the pedigree file (i.e. sex, father_id or mother_id). Values for QTL covariates have to be QTL IDs as specified in the trait description file.


Figure legends

Figure 1: Overview of experiment design that is covered by the MIQAS_TAB specification.