Are guarantees of genome anonymity realistic?

(back to PGP main page)


The HapMap reconsent form says “Because the database will be public, people who do identity testing, such as for paternity testing or law enforcement, may also use the samples, the database, and the HapMap, to do general research. However, it will be very hard for anyone to learn anything about you personally from any of this research because none of the samples, the database, or the HapMap will include your name or any other information that could identify you or your family.”

 [see , NIH Seeks Input on Proposed Repository for Genetic Information , Diagnostic Multivariate Index Assays ].


At least 10 scenarios exist where ‘anonymous’ human subjects consent design can be compromised, each with precedents below.


(10) Re-identification after “de-identification” using other public data.

Given only redacted Group Insurance Commission list of birth date, gender, and zip code was sufficient to re-identify full medical records of Governor Weld & his family via voter-registration records. 

[See Sweeney 1998 ]


(9) Hacking. 

“Drug Records, Confidential Data vulnerable via Harvard ID number & PharmaCare loophole”.  

A hacker gained access to confidential medical information at the University of Washington Medical Center, using the Internet to download thousands of files containing patient names, conditions, home addresses and Social Security numbers”

[See Harvard Crimson  2005  ]


(8) Combination of surnames from genotype with geographical info

An anonymous sperm donor was traced on the internet 2005 by his 15 year old son who used his own Y chromosome genealogy to access surname relations.

[See ]


(7) Inferring phenotype from genotype

Markers for eye, skin, and hair color, height, weight, racial features, dysmorphologies, etc. are known and the list is undergoing rapid growth and refinement. 

[See table below]


(6) Unexpected self-identification. 

An example of this at Celera undermined confidence in the investigators.

[See Kennedy D. Science. 2002 297:1237. Not wicked, perhaps, but tacky.]


(5) A tiny amount of DNA data in the public domain with a name leverages the rest.

This would allow the vast amount of DNA data in the HapMap (or other study) to be identified.  This can happen for example in court cases even if the suspect is acquitted. [See]


(4) Laptop theft.
26 million Veterans' medical records including SSN and disabilities stolen Jun 2006. Hewlett-Packard, Ford, Ameriprise, and Verizon.

(3) Unauthorized access to DNA bearing samples (e.g. hair, dandruff, hand-prints or lip-prints on glasses, etc).

(2) Government subpoena. False positive IDs can be very disruptive to the affected family.


(1) Identification by phenotype. 

For example if CT or MR imaging data is part of a genetic study, although doesn’t look identifiable, it is becoming increasingly easy to reconstruct the appearance of a person based on such data.  Even blood chemistry can be identifying in some cases. 

[See Modeling Age, Obesity, and Ethnicity in a Computerized 3-D Facial Reconstruction  ,

King Tut's New Face: Behind the Forensic Reconstruction ,
Walker helps on cold case and
Forensics experts recreate face from bone fragments ]


There are no doubt other scenarios. Any one of these could have psycho-social, health or economic impact on unprepared or unwilling human research subjects and/or their families. These scenarios also could cause significant loss of trust or public-relations backlash and a serious setback for NHGRI and the investigators involved.  Even though scenario #5, may not sound like it would a high impact, it did cause a significant amount of alarm in ELSI, IRB, corporate and editorial circles.  Variations on that theme could lead to identification of other members of an ‘anonymous’ pooled cohort. Discussing a plan for release of the identifying information in advance would have been preferable in that case (and probably is advisable in general).


For further discussion see the Personal Genome Project (PGP) editorial and web page.

  • G M Church GM (2005) The Personal Genome Project Nature Molecular Systems Biology doi:10.1038/msb4100040
  • Kohane IS, Altman RB. (2005) Health-information altruists--a potentially critical resource. N Engl J Med. 353:2074-7.
  • McGuire AL, Gibbs RA (2006). Genetics. No longer de-identified. Science. 312:370-1.


    Examples of identifying information found in human DNA & RNA:


    Trait                           Genes             Chromosome location

    Hair/iris color             ASIP                20 q11.2

    Hair/iris color             DCT                13 q32

    Green/blue iris            EYCL1            19 p13.1-q13.11

    Brown/blue iris           EYCL3            15 q11-q15 *

    Height                          GH1                17 q22-q24

    Height (Laron)             GHR                  5 p13-p12

    Brown/blond hair        HCL1              19 p13.1-q13.11

    Brown/blond hair        HCL3              15 q11-q15  *

    Brown/red hair            HCL2               4 q28-q31

    Hair/iris color             HPS1               10 q23.1-23.3

    Hair/iris color             HPS2               10 q24.32

    Skin&hair color          MC1R             16 q24.3

    Height (Marfan)          MFS                15 q21.1

    Hair/iris color             MITF                 3 p12.3-14.1

    Hair/iris color             MYO5A          15 q21

    Ocular albinism           OA1                X p22.3 

    Ocular albinism           OA2                X p11.4-p11.23

    OcculoCut.Albinism    OCA2              15 q11.2-q12  * R305W, R419Q blue to brown & green resp.

    Hair/iris color             PMOC               2 p23.3

    Hair/iris color             RAB27A         15 q15-21.1

    Hair/iris color             SILV                12 q13-q14

    Skin color                    SLC24A5        15 q21.1 A111T dark to light skin

    Short Stature                SS                    X&Y p

    Hair/iris color             TYR                11 q14-q21

    Hair/iris color             TYRP1              9 p23


    The human genome project sequence is largely from one man from Buffalo, NY (code RP11).

    Out of ten volunteers in 1997, one male was “selected at random … Unfortunately, the attempt to prepare EBV-transformed cells for the RPCI-11 donor failed. As a consequence of the double-blind donor selection procedure, it was impossible to obtain a second sample from the same male donor for a second attempt to establish transformed cells.”   See (Osoegawa et al 2001).  Another donor identified himself in 2002.


    Gene               DNA (biallelic bp in bold in central codon)              one RP11 allele

    SLC24A5   atgttgcaggc Rca actttcatggcagcgg  (R=g = darker skin)

    OCA2_305  tccatcagcat cYg ggcctccctgcagcag  (Y=c = bluer eyes)

    OCA2_419  accggctctcc cRg ggacgggtgtgggcca  (R=g = bluer eyes)

    Note that the reference human genome represents only one of the two alleles in RP11 (above), but both alleles will be available for the HapMap individuals from Ibadan, Nigeria (YRI), Tokyo, Japan (JPT), Beijing, China (CHB), Utah, USA (CEU). 



    Family exposure:  anonymity vs advocacy

  • Examples of false security from Anonymity

  • Challenge of updating sperm-donor genetics to recipients
  • Challenge of updating sperm-donor offspring at risk of half-sibling marriage.
  • Over 600,000 AOL users "private" online searches remain public well after 'fix' attempted.