MIQAS_TAB—Tab-delimited files describing a QTL or association study

version 1.0 (June 2008)

Introduction

The MIQAS_TAB set of file formats are designed to describe the results from a QTL or association study according to MIQAS minimal requirements. These files are all tab-delimited and provide an alternative solution for data exchange from MIQAS_XML, the XMLSchema provided with the MIQAS standard. Even though the information described here could be entered in different sheets of a spreadsheet application like Microsoft Excel or OpenOffice, the MIQAS_TAB specification explicitly refers to tab-delimited text files. One of the main reasons is the fact that spreadsheet programs can alter data in unexpected ways.

The different sections of the MIQAS standard are covered by one or more tab-delimited files. The main file, called experiment_description_file.txt, provides meta-data for the study and links the rest of the data into a coherent whole. When one submits data using MIQAS_TAB, the researcher sends a group of files and indicates which of these is the experiment description file. A system loading this data will use the information in that file to locate the rest of the data. An overview of the conceptual parts covered by the MIQAS_TAB files is presented in Figure 1.

The following files are used:

experiment description file: Contains meta-information and provides links to the rest of the data.
population description file: Describes the population used for the analysis.
pedigree file: Contains sex and pedigree for the individuals.
trait description file: Describes the trait(s) under investigation. More than one trait can be covered in a single file.
marker description file: Provides accession numbers and possible alleles for all markers.
map file: Provides the genetic map for the markers as generated on the study sample.
genotype data file: Contains the genotypes for all markers vs individuals.
phenotype data file: Contains the phenotypes for all traits vs individuals.
results file: Contains specifics for the analysis and location of QTLs and associated markers.

Some of these files are always required, while others are optional under certain circumstances. See the documentation for the experiment description file (below) for more information.

There are two alternative formats for these files: either they contain two columns (i.e. tag-value pairs) or are a grid of data.

The id field mentioned in several of the files acts as an internal reference within the group of files. No additional meaning should be attributed to it.

1. Experiment description file (required)

The experiment description file contains meta-data for the study and links to the rest of the data. Rows in this file normally contain 2 columns (with one exception), following a tag-value paradigm: the first column provides a tag while the second column is the value for that tag.

Data provided in this file:

MIQAS specification version (required): The version of the specification used (which is displayed at the top of this document).
title (required): A title describing the study in one sentence.
description (optional): A more verbose description of the study.
contact email (required): The email address of a contact person for this study.
contact last name (optional)
contact first name (optional)
contact phone (optional)
contact address (optional)
contact affiliation (optional)
reference PMID (optional)
reference authors (optional)
reference title (optional)
reference journal (optional)
reference issue (optional)
reference year (optional)
reference pages (optional)
related result (optional; might be more than one)
population description file (required)
pedigree file (required if genotypes and/or phenotypes are provided)
trait description file (required)
marker description file (required)
genotype data file (optional)
phenotype data file (optional)
map file (required if QTL study)
results file (required; might be more than one)

At minimum, the specification version, a title and a contact email have to be provided, as well as names for some of the other files. Entries for related result contain an extra column indicating the type of relationship.

Files that must be provided are the experiment description file, the population description file, the trait file and the marker file. Some files are mandatory under certain circumstances: the qtl file and map file are mandatory if the experiment describes a QTL study. The association file is mandatory if the experiment describes an association study. In addition, the pedigree file is mandatory if either the genotype or phenotype file is present.

Typical issues with this file:

One of the mandatory fields is not provided.
One of the files mentioned are not present.

Example

specification version	1.0
title	Roslin Institute QTL study 123
description	this is what the study is about
contact last name	Aerts
contact first name	Jan
contact email	jan.aerts@myisp.com
reference PMID	12345678
related result	ArkDB\|ARKQTL00000001	fine_mapping
population description file	population.txt
pedigree file	pedigree.txt
trait description file	traits.txt
marker description file	markers.txt
genotype data file	genotypes.txt
phenotype data file	phenotypes.txt
map file	map.txt
results file	results_first_analysis.txt
results file	results_second_analysis.txt

2. Population description file (required)

The population description file contains information of the population. Like the experiment description file, the population description file contains two columns with tag-value pairs.

Data provided in this file:

species (required): The population species.
description (required): A verbose description of the population.
source (required): The source of the population.
structure (required): The structure of the relationships between individuals.

Example

species	Bos taurus
description	BovGen – cross between extreme phenotypes for response for bacterial infection
source	Holstein x Belgian Blue
structure	intercross

3. Pedigree file (conditional)

The pedigree file contains information on the sex of the individuals and the relationships between them. This file is grid-based. The four columns contain the individual id (as used in the genotype and phenotype data files), the sex, the id of the father and the id of the mother. Only the first column is mandatory. The first row contains the headers: ”#individual_id”, “sex”, “father_id” and “mother_id”. Values for sex can be “UNKNOWN”, “MALE”, “FEMALE” or “MALE&FEMALE” (e.g. for oyster). Unknown values should be reported as “UNKNOWN”. Both the father_id and mother_id must be values that are also present in the individual_id column.

This file is mandatory if genotypes or phenotypes are provided (even if it just would contain a list of IDs). If neither the genotypes and phenotypes file is given, the pedigree file is still recommended.

Typical issues with this file:

The header row is missing or contains wrong terms.
The header row does not start with a hash (#).
The individual_id column is missing.
Sex is anything else than “MALE”, “FEMALE”, “UNKNOWN” or “MALE&FEMALE”.
Value for father_id or mother_id is not UNKNOWN or an ID that is present in the individual_id column.
Inconsistencies in sex, e.g. where the father_id refers to an individual that is female.

Example

#individual_id	sex	father_id	mother_id
ind_1	MALE	UNKNOWN	UNKNOWN
ind_2	FEMALE	UNKNOWN	UNKNOWN
ind_3	MALE	ind_1	ind_2
ind_4	MALE	ind_3	ind_2
ind_5	FEMALE	ind_3	ind_2

4. Trait description file (required)

The trait description file contains all information necessary to describe the traits under investigation. Like the experiment description file, the trait description file contains two columns with tag-value pairs. More than one trait can be described by separating them with a blank line.

Data provided in this file (for each trait):

id (required): A unique identifier to be used to refer to this trait in other files. This ID must not contain spaces.
name (required): A short name to describe the trait.
description (optional): A verbose description of the trait.
ontology (optional; might be more than one): Ontology terms describing the trait. This should be the authority (e.g. “PATO”, Plant and Animal Trait Ontology), followed by a vertical line and the accession number for the term.
unit (required): The unit in which the phenotypic data are reported. This is important to interpret the data in the phenotype data file.

Typical issues with this file:

The id field contains spaces.
Data for different traits are not separated by an empty line.
A mandatory field is not provided.
The ontology term is not of the format “authority|accession”.

Example

id	ISU1
name	fat depth
description	thickness of the fat in the back of the animal
ontology	PATO\|acc0000001
ontology	Evo\|acc999999
unit	centimeter

id	BW
name	birth weight
description	weight at birth
unit	grams

id	carcass_weight
name	carcass weight
description	weight of the animal after slaughter
unit	grams

Three traits are listed in this example. It is made clear that the id for each of them can be any string and has no other meaning than to be a unique identifier that can be reference in the other MIQAS_TAB files, in particular the phenotypes and results files.

5. Marker description file (required)

The marker description file contains all information necessary to uniquely identify the genetic markers used in the study. Like the previous two files, this file has two columns. Information for different markers is separated by an empty line.

Data provided in this file:

id (required): A unique identifier to refer to in other files. The ID should not contain spaces.
name (optional; might be more than one): Descriptive name for the marker.
accession (required; might be more than one): Accession number of the marker in a public or private database. The format consists of the authority, followed by a vertical line and the accession number (e.g. “dbSNP|rs0000001”)
allele (optional; will be more than one): Allele code used in the genotype data file. Second column contains the code and the additional third column the value that corresponds to that code.

Typical issues with this file:

The id field contains spaces.
A mandatory field is not provided.
Data for different markers is not separated by an empty line.
The accession is not of the format “authority|accession”.
The same allele code is used more than once for the same marker.

Example

id	CGA
name	CGA
accession	ArkdB\|ARKMKR00001
allele	0	unknown
allele	1	137 bp
allele	2	188 bp
allele	3	deletion

id	S0082
accession	dbSNP\|rs0002
accession	ArkDB\|ARKMKR00002
allele	0	unknown
allele	1	A
allele	2	G

id	S0083
accession	dbSNP\|rs00003
allele	0	unknown
allele	A	A
allele	G	G

id	S0085
accession	dbSNP\|rs00004

id	SW1828
accession	dbSNP\|rs00005

In this example, the first marker will be referred to in the other files as CGA, i.e. the value for the id field. Four different alleles are described. The values 0 to 3 can be used in the genotype data file and refer to the actual alleles as provided in the third column. These values can e.g. be fragment lengths if the marker is a microsatellite. The second marker is a SNP and has three alleles. This marker has been defined in two different databases that each contain information on it. The third example marker shows that anything can be put in the second column of the allele fields: even the actual biological information. This means that the anonymous label “A” in the genotype.txt file for this marker actually means adenosine. Finally, the last two markers do not list any alleles. In this case, the alleles in the genotypes.txt file must be considered as just anonymous labels.

6. Genotype data file

The genotype data file contains the genotypes for all markers for all individuals. The first column contains the individual id (referring to the individual id in the pedigree file) and the other columns contain genotypes for the different markers. The first row is a header row and should consist of ”#individual_id” in the first cell, followed by marker ids. These marker ids should correspond to the id field in the marker description file; if the marker id cannot be found back in that file, the genotype file is invalid. The values for the genotype recordings must be present in the second column of the “allele” fields in the marker description file. The same marker id should appear once for each allele; for diploid organisms, there should therefore be two columns for each marker.

Typical issues with this file:

The header row is missing or contains wrong terms.
The individual_id column is missing.
The header row does not start with a hash (#).
A marker ID is not defined in the marker description file.
An individual ID is not defined in the pedigree file.
Values for the markers are not mentioned as alleles in the marker description file.

Example

#individual_id	CGA	CGA	S0082	S0082	S0155	S0155	SW1301	SW1301
ind_1	1	4	4	7	1	2	1	2
ind_2	1	4	4	3	1	1	1	2
ind_3	0	0	0	0	1	3	1	4
ind_4	1	2	2	2	1	1	2	2
ind_5	1	1	1	2	3	3	2	3

7. Phenotype data file (optional)

The phenotype data file is similar to the genotype data file and contains the values for phenotype recordings of the individuals of the study. This file is also grid-based. The first column contains the individual id (referring to the individual id in the pedigree file) and the other columns contain values for the different traits. The first row is a header row and should consist of ”#individual_id” in the first cell, followed by trait ids. These trait ids should correspond to the id field in the trait description file; if the trait id cannot be found back in the trait description file, the phenotype file is invalid. The values for the phenotype recording must be in the unit as recorded in the trait description file.

Typical issues with this file:

The header row is missing or contains wrong terms.
The individual_id column is missing.
The header row does not start with a hash (#).
A trait ID is not defined in the trait description file.
An individual ID is not defined in the pedigree file.
Values for the traits are not in the units as described in the trait description file.

Example

#individual_id	BW	carcass_weight	ISU1
ind_1	67800	382100	19
ind_2	75400	283100	24
ind_3	69800	235500	29
ind_4	64800	318260	18
ind_5	65800	291700	19

8. Map file (conditional)

If the MIQAS experiment describes a QTL study, the genetic map has to be provided in the map file. This file, in contrast to the previous three files, contains a grid with the first column containing the marker identifiers (i.e. the value for id in the marker description file), the second column contains the linkage group and the third column contains the position in centiMorgan. The first row contains the headers: ”#marker_id”, “linkage_group” and “position”. For each linkage group, there should be a marker at zero centiMorgan.

Typical issues with this file:

The header row is missing or contains the wrong terms.
The header row does not start with a hash (#).
One of the columns is missing.
A marker ID is not defined in the marker description file.
There is a linkage group without a marker at position zero.

Example

#marker_id	linkage_group	position
S0083	1	0
CGA	1	42
S0082	1	70
S0085	7	0
SW1828	7	106

9. Results file (required)

A results file contains all QTLs or associated markers that are identified within a single analysis. As with other tag-value files described above, different results (i.e. QTLs or associated markers) are separated by an empty line. The first block of rows describes the analysis and ends with a blank line. Subsequent lines report QTLs or associations, each separated with an empty line.

Data provided in this file:

Analysis-related:
- id (required): A unique identifier for the analysis. This ID should not contain spaces.
- trait (required; might be more than one): The traits included in the analysis. The value should be defined as an id in the traits file.
- genetic model (required): The genetic model under which the analysis was performed (e.g. “dominant effect” or “imprinting effect”)
- software (required): The software tool used to do the analysis.
- approach (optional): The approach used in the analysis (e.g. “least square regression interval mapping”)
- assumption (optional; might be more than one): Any assumptions taken into consideration.
- fixed effect (optional; might be more than one): Any fixed effect taken into consideration.
- trait covariate (optional; might be more than one): Any covariate taken into consideration. Value should be defined as an id in the traits file.
- qtl covariate (optional; might be more than one): Any QTL covariate taken into consideration. Value should be defined as an id for a QTL in the results files.
In case of QTL study, for every QTL:
- qtl id (required): A unique ID for the QTL. This ID should not contain spaces. Caution: these have to be unique over the different results files.
- trait (required): Trait that this QTL applies to. The value should be defined as an id in the traits file.
- linkage group (required): The linkage group the QTL was found on. This should be one of the linkage groups mentioned in the map file.
- region start (required): Left boundary of the QTL, in cM.
- region stop (required): Right boundary of the QTL, in cM.
- region peak (optional): Peak position in cM.
- confidence interval start (optional): Left boundary of the QTL confidence interval, in cM.
- confidence interval stop (optional): Right boundary of the QTL confidence interval, in cM.
- confidence interval method (required if confidence interval provided): Method used to calculate confidence interval.
- statistical significance (required): Statistical significance of the result.
- proportion explained (required): The proportion of the phenotype that is explained by this QTL, in percent.
- significance term (optional): One of the terms “chromosome-wide suggestive”, “chromosome-wide significant”, “genome-wide suggestive” and “genome-wide significant”.
- significance criterion (required if significance term provided): The criterion used to determine the significance term.
In case of assocation study, for every associated marker:
- association id (required): A unique ID for the association. This ID should not contain spaces. Caution: these have to be unique over the different results files.
- trait (required): Trait that this association applies to. The value should be defined as an id in the traits file.
- associated marker (required): ID of the associated marker. This ID should correspond to one of the markers in the marker description file.
- statistical significance (required): Statistical significance of this association.
- proportion explained (required): The proportion of the phenotype that is explained by this marker, in percent.

Typical issues with this file:

The analysis ID, QTL ID or association ID contains spaces.
A QTL or assocation ID is not unique over all results files.
The first set of lines (above the first empty line) does not describe the analysis.
The trait is not identifiable in the trait file.
QTL or association IDs are not unique over the different results files.
A mandatory field is not provided.
Fixed effect or covariate not one of the allowed values (see below).
Data for different QTLs or associations is not separated by an empty row.
Trait mentioned for a QTL or association is not mentioned in the analysis section at the top of the file.
For QTLs:
- The linkage group for a QTL is not present in the map data file.
- The region stop position is outside of the range defined in the map data file.
- The region peak is not between the region start and region stop.
- The confidence interval stop position is outside of the range defined in the map data file.
- Significance term not one of the four allowed terms.
For associations:
- Value for associated marker is not an ID that is defined in the marker description file.

Example

Results file A

id	results_a
trait	BW
trait	carcass_weight
genetic_model	additive
software	GridQTL
approach	least square regression interval mapping
assumption	effect same in males as in females
fixed effect	sex
trait covariate	ISU1

qtl id	qtl_1
trait	BW
linkage group	1
region start	7.2
region stop	9.3
region peak	8.1
confidence interval start	5.2
confidence interval stop	13.1
confidence interval method	bootstrapping
statistical significance	P<0.05
proportion explained	19.1%
significance term	genome-wide significant
significance criterion	LOD > 3

qtl id	qtl_2
trait	BW
linkage group	1
region start	19.2
region stop	23.3
region peak	19.9
confidence interval start	15
confidence interval stop	28
confidence interval method	bootstrapping
statistical significance	P<0.05
proportion explained	5.4%
significance term	chromosome-wide suggestive
significance criterion	LOD > 3

Results file B

id	results_b
trait	BW
trait	carcass_weight
genetic model	additive
software	GridQTL
approach	least square regression interval mapping
fixed effect	sex
trait covariate	ISU1

qtl id	qtl_3
trait	BW
linkage group	1
region start	5.4
region stop	11.2
region peak	8.6
confidence interval start	3.1
confidence interval stop	14.2
confidence interval method	bootstrapping
statistical significance	P<0.05
proportion explained	14.1%
significance term	genome-wide significant
significance criterion	LOD > 3

Three QTLs are described in this example, separated into two files. The analysis for both files is exactly the same, except that the first analysis (in results_a) uses an assumption that is not present in the second analysis. Therefore, these must be separated into different files. The analysis part of these files should contain all traits that are included in the analysis, even if no QTLs or associated markers could be identified for them. Assumptions should be very short (e.g. “effect same for male and female”) and not contain multiple assumptions in one row. Multiple assumptions should be split over multiple rows.

Values for fixed effects and trait covariates are either (a) traits that are described in the trait description file (and linked through the id field), or (b) any header in the pedigree file (i.e. sex, father_id or mother_id). Values for QTL covariates have to be QTL IDs as specified in the trait description file.

Figure legends

Figure 1: Overview of experiment design that is covered by the MIQAS_TAB specification.