(Version 1.1) The data files below describe the SNP data used for the
paper "A worldwide survey of haploype variation and linkage
disequilibrium in the human genome" by DF Conrad, M Jakobsson, G Coop
et al. (Nature Genetics 38:1251-1260 [2006]).

*Version 1.1 of the package of files - created by Noah, Jan 19, 2007
   Files 1 and 4 added. 
*Version 1.0 of the package of files - created by Jon, Dec 11, 2006

---------------------------------------------------------------------

The following data sets are available:

1. unphased_HGDP_regions1to36 (HGDP data - 927 individuals, 2834 SNPs)
2. phased_HGDP_regions1to36   (HGDP data - 927 individuals, 2834 SNPs)
3. phased_HapMap_regions1to36 (Phase 2 HapMap genotypes for SNPs in 
	our data - 210 individuals, 2046 SNPs)
4. phased_combined_regions1to36 (HGDP+HapMap genotypes for SNPs in 
	our data - 1137 individuals, 2046 SNPs; this file previously 
	is based on the file previously called hgdp_hapmap_merged)

File 1 is the raw unphased data, after elimination of SNPs that failed
quality checks and individuals who were related.

File 2 is the phased data with all missing genotypes imputed,
using the genomic "regions" labels (1-36) as described in
Supplementary Table SM.2.  Almost all of the analyses in the paper
used this version of the data.

File 3 is the phased HapMap data with all missing genotypes imputed.
This file was created from the HapMap collection by leaving out the
offspring in CEU and YRI trios.

File 4 is the combined data from Files 2 and 3.

For some SNPs, Files 1 and 2 differ in strand polarity from File 3 -
that is, for some of the SNPs, our data arrived with a strand polarity
different from that in the HapMap.  To create File 4, the HapMap data
in File 3 was repolarized to match the HGDP data and was combined with
the subset of SNPs in File 2 present in the HapMap.  After publishing
the paper we later discovered that in File 4 the allele labels seem to
be switched between HapMap and HGDP for one SNP in region 21:
rs2183577.  The attached File 4 contains the data (with this error) in
the way that we analyzed them in the paper.

The data are in "structure" format with 2 rows per individual.  

Rows:
1. rs number
2. region number (1..36)
3. chromosome number
4. snp position on chromosome
5...1858: individual data for 927 individuals

In the phased files (Files 2-4) each of the two rows for an individual
represents one of the two haplotypes.  Phasing was performed within
genomic regions, so there is no correspondence of haplotypes across
region boundaries.  In the unphased file (File 1), the placement of
genotypes on the first versus second line for an individual is
arbitary.

Columns for individual data (HGDP individuals):
1. HGDP ID number
2. numeric code for population
3. name of population
4. country of origin
5. geographic region of origin
6. ID number assigned during genotyping
7. sex
8... genotypes (A, C, G, T, or ? for missing data or hemizygous males 
   on the X-chromosome)

Columns for individual data (HapMap individuals):
1. HapMap ID number (string)
2. numeric code for individual - this code is not unique and repeats 
	across HapMap populations; the number may also be shared with a 
	population code used for a HGDP population
3. name of population (YRI, CEU, or JPT+CHB)
4. name of population (YRI, CEU, or JPT+CHB)
5. name of population (YRI, CEU, or JPT+CHB)
6. meaningless- a placeholder to make the number of columns match the HGDP data
7. meaningless- a placeholder to make the number of columns match the HGDP data
8... genotypes (A, C, G, T, or ? for missing data or hemizygous males 
   on the X-chromosome)