HOME ABOUT CONTACT
High Throughput Biology
BioIDE Platform    
PatternExaminer    
Data Processing Tools    
Visualization Tools    
Services    
Community    
  Community Services  
Description of Data Formats Supported by BioIDE
 

Modified Linkage Format

The Linkage format is a widely used format to represent genetic data. In its official form it covers four different files containg data for marker map information, pedigree information, genotype information, and locus information. However, with SNP data, frequently only two files are required, one containing marker information (the map file) and the other containing phenotypes and genotypes (the data file). Limited information on markers and on phenotypes are contained in the original Linkage format.

BioIDE supports the original Linkage format with one map file and one data file. However, to accommodate researchers that have collected more information on markers (such as gene information and multiple mapping information) and/or more phenotype information on individuals, the Linkage format supported by BioIDE is extended to allow unlimited number of columns for the map file and the data file.

  • Map File:
    • With a tab or space delimited header line describing the columns in the map file
    • The header line must contain the following words
      • MarkerID
      • Chromosome
      • Position1
    • The header line can contain any other words as column headers. However to allow the system to correctly interpret the columns, the following controlled vocabulary should be used:
      • MarkerType: describes the type of markers such as SNP or Microsatellite
      • MapName1: describes the name of the map or the first map
      • MapUnit1: describes the map unit for the first map
      • EndPosition1: describes the end of Position1 if a chromosomal fragment is given in the map, in which Position1 defines the starting position and Position1End defines the ending position
      • MapName2: describes the name of the second map
      • MapUnit2: describes the map unit for the second map
      • Position2 (if this file contains two maps): describes the position or starting position of the marker in second map
      • EndPosition2: describes the ending position of the marker in second map
    • The words in the header line can be in any order

  • Data File:
    • With a tab or space delimited header line describing the columns in the map file
    • The header line must contain the following words
      • IndID
      • Affected
      • Genotype
    • The header line can contain any other words as column headers. However to allow the system to correctly interpret the columns, the following controlled vocabulary should be used if applicable:
      • PedigreeID: family ID
      • FatherID: IndID for father
      • MotherID: IndID for mother
      • Gender: gender for individual
    • The words in the header line can be in any order except for “Genotype”, which has to be the last word
    • In the data rows following the header line, genotype data for each marker in the Map File are represented by two columns of allele values separated by space
    • In the data rows, if any allele value is missing, “0” has to be put in the place. For example, if one allele is missing and the other allele has value of “1”, then the two columns for that marker for that individual will be “1 0”.

Affymetrix SNP 500K Format

At the moment, .call files and .conf files containing genotypes and quality scores are used for the Affymetrix SNP 500K format. One can upload more than one .call files or .conf files, although the number and the names (before the dot) of .call files has to match the number and names of .conf files, respectvely.

No map file is necessary for data loading. Map files containing cross references between Affy marker IDs and rs#s can be appended later.

  • .call File:
    • The header line contains the IDs for each individual, separated by tab. The first column is empty.
    • Each subsequent line contains the marker ID and genotypes for that marker ID for all individuals, separated by tab. The genotype is presented as either "AA", "AB", or "BB".
    • Missing allele value is represented by "N"

  • .conf File:
    • The header line contains the IDs for each individual, separated by tab. The first column is empty.
    • Each subsequent line contains the marker ID and quality scores for all individuals, separated by tab. One quality score for each genotype (not allele).

Affymetrix SNP 6 Format

At the moment, .call files and .conf files containing genotypes and quality scores are used for the Affymetrix SNP 6 format. One can upload more than one .call files or .conf files, although the number and the names (before the dot) of .call files has to match the number and names of .conf files, respectvely.

No map file is necessary for data loading. Map files containing cross references between Affy marker IDs and rs#s can be appended later.

  • .call File:
    • There are many header lines in this file. The header line above the actual genotype data contains the file names for each .CEL file, serving as individual IDs.
    • Each subsequent line contains the marker ID and genotypes for that marker ID for all individuals, separated by tab. The genotype is presented as either "0", "1", or "2". "0" represents "AA", "1" represents "AB", "2" represents "BB".
    • Missing genotype value is represented by "-1"

  • .conf File:
    • There are many header lines in this file. The header line above the actual genotype data contains the file names for each .CEL file, serving as individual IDs.
    • Each subsequent line contains the marker ID and quality scores for all individuals, separated by tab. One quality score for each genotype.

Illumina Format

Currently the matrix format produced by Illumina BeadStudio is supported. In this format, all data are stored in a single file as a two-dimensional matrix. After a few header lines, the individual IDs are listed on a single line separated by tab. After that line, genotype data for each marker is organized as one line per marker with the marker IDs (rs#) as the first column. For each individual on each marker, the genotype represented by two alleles is separated from its quality score by a vertical divider.


PLINK Format

Details on the PLINK format can be found at the PLINK website. Briefly, it includes two files, one .ped file containing genotype and limited phenotype information, and one .map file containing mapping information. The .ped file requires six phenotype columns, and the .map file requires four columns exactly.


Haploview Format

Details on the Haploview format can be found at the Haploview website. Haploview supports many file formats. However, the Haploview format used most often is its Linkage format, in which a data file and a map file are sufficient. The data file is in the pre-MAKEPED format, mandating six phenotype columns in fixed order before genotypes are presented. The map file requires two columns only, one for marker ID and the other for map location.


FBAT Format

Details on the FBAT format can be found at the FBAT website. Briefly, the FBAT format includes one mandatory pedigree file and an optional phenotype file.

  • Pedigree File:
    • The first line lists the names of the markers
    • the remaining lines contains 6 standard phenotypes and marker genotypes in the following order:
      • pid: pedigree ID
      • Id: individual ID
      • Fid: Father ID (use 0 for founders or marry-ins)
      • Mid: Mother ID (use 0 for founders or marry-ins)
      • Sex: 1= male, 2= female
      • Aff: affection status (2 = affected. 1 = unaffected, 0 = unknown)
      • genotypes: two allels for each marker. Use 0 for missing alleles
    • All ID’s and marker names are composed of strings of any characters that do not include blank space, tab, newline, and carriage return.
    • The maximum length for IDs and marker names are 16 and 64 characters, respectively.
    • A maximum number of 40 alleles are allowed for each marker.

  • Phenotype Data File (optional):
    • The first line lists names of all traits in the phenotype file
    • The remaining lines start with "pid" and "id", followed by values of the traits

QTDT Format

Details on the QTDT format can be found at the QTDT website.


fastPHASE Format

Details on the fastPHASE format can be found at the fastPHASE 1.2 documentation page. Briefly, a single file is needed for this format. The input file can be represented as followed:

         	no of individuals
			no of SNPsites
			P pos(1) pos(2) ... pos(no of SNPsites) (optional line)
			SSS...SSS (optional line)
			ID (1)
			genotypes(1-a)
			genotypes(1-b)
			ID (2)
			genotypes(2-a)
			genotypes(2-b)
			.
			.
			.
			ID (no.individuals)
			genotypes(no.individuals-a)
			genotypes(no.individuals-b)            
            

HapBlock Format

Details on the HapBlock format can be found at the HapBlock Help page. The HapBlock format includes two files, one is SNP map file and the other genotype file.

  • Map File:
    • The first line contains the number of markers
    • the remaining lines contains marker IDs with each marker ID occupying a line

  • Genotype File from unrelated individuals:
    • The first line contains the number of individuals and the number of markers
    • The remaining lines start with individual IDs, followed by two alleles for each SNP

EIGENSOFT Format

EIGENSOFT is a software package containing C++ and Perl programs. You can find information about it at its web site. It supports multiple file formats. Documentation on the file formats can be downloaded together with the program. Mainly it needs three files to load data properly.

  • Map File (SNP File):
    • One line per SNP with four columns:
      • SNP Name
      • Chromosome
      • Genetic position in Morgan
      • Physical postion in bases
    • the remaining lines contains marker IDs with each marker ID occupying a line

  • Genotype Data File:
    • One line per SNP
    • In each line, one character is used to represent one individual:
      • 0 means zero copy of reference allele
      • 1 means one copy of reference allele
      • 2 means two copies of reference allele
      • 9 means missing data
    • the remaining lines contains marker IDs with each marker ID occupying a line

  • Individual Data File:
    • One line per individual with three columns:
      • Sample ID
      • Gender (M or F)
      • A label that might refer to case/control (affected status) or a population group label. If this value is set to be “Ignore”, then this individual and all genotype data from this individual will be removed from the data set in all convertf output.
    • the remaining lines contains marker IDs with each marker ID occupying a line

  • Phenotype Data File:
    • Only one line with one character for each individual:
      • 0 means control
      • 1 means case
      • 9 means missing phenotype

PrettyBase Format

Details on the Prettybase format can be found at the Seattle SNP Prettybase format page. The Prettybase format only requires one file containing individual IDs, locus IDs, and two alleles for each marker on each individual.


Generic 2D Matrix Format

The generic 2D Matrix Format is developed by High Throughput Biology Inc. It was conceived from hearing frustrations from researchers in the field of not knowing what format their data are in. This format allows a researcher to load in any data as long as they have the following properties:

  • Data are organized as a two-dimensional matrix
  • The data file(s) can have headers or have no header
  • all data points are tab or space delimited
  • Data can be in one file (genotype/phenotype file) or in two files (phenotype/genotype file and marker file)
    • Data In One File:
      • If headers are present, it should contain phenotype names and marker names for correponding columns in the data file.
      • If header is not present, BioIDE gives user the chance to provide them during data loading.
    • Data In Two Files:
      • If headers are present in the phenotype/genotype file, it should contain phenotype names for correponding columns in the data file.
      • If headers are present in the marker file, it should contain labels for correponding columns in the marker file. A marker ID column is required.
      • If header is not present, BioIDE gives user the chance to provide them during data loading.


previous pageOverview      Component Forumsnext page

HOME | FAQ | PRIVACY AND LEGAL | CONTACT