Y-Haplogroup Predictor Instructions


NEW (12/15/2012):  A long overdue upgrade to the Haplogroup Predictor program has been implemented, though for the immediate future, the old version is still available while we gain experience with the new one.  For the old version of the program, you can choose the 23-haplogroup version, or a 27-haplogroup version that includes some of the Asian haplogroups such as O2, O3, plus an interesting small haplogroup, E1a.  The haplogroup names have been updated to the 2012 ISOGG versions, or to the "short form" that includes a terminal SNP for the haplogroup.

The new version of the program has adopted the 111-marker set of Family Tree DNA as the standard.  Of the 86 markers used in the previous version of the program, only DYS508 is not included in the 111-marker set, and this marker seems not to be commercially available in any event.  The new version of the program has cut back somewhat on the number of haplogroups in the main program (to 21).  It is the intent that a user would determine the major haplogroup from the main program, then, if a sub-clade program is available, then a smaller set of markers would be used to predict the sub-haplogroup.

Initially, only a sub-clade program for Haplogroup G2a is available, but others will be added as they are developed.  The subclade programs will use a subset of markers chosen for their ability to help distinguish among the subgroups.  The markers are ordered roughly according to their usefulness for this purpose, but weight is also given to the markers that are included in the smaller haplotypes such as the 12-marker, 25-marker, or 37-marker sets.  Only markers in the 67-marker set are presently used in the subclade programs because of a lack of 111-marker data by subgroup.  This may be changed in the future as more data becomes available.

NEW (06/15/08):  You can choose between the basic 21-haplogroup program, or a "beta" 23-haplogroup version.  The former "beta" 21-haplogroup version has proven itself and now is the basic version.  The 23-haplogroup version adds two haplogroups to the previous 21: C3 and G1.  The haplogroup names have been updated to correspond to ISOGG-2008.

NEW (02/04/07):  Leo Little has written a nice text entry feature, which Doug McDonald has adapted to the Haplogroup Predictor program.  You can now cut and paste a text string containing Y-STR values to a field at the bottom of the data entry page.  You might, for example, cut and paste from a web site such as your surname project, or from Y-Search or another database.  You can select the character that separates the values so that the string is properly interpreted.  If you get unexpected results, check the regular data entry boxes to make sure the right values are displayed.

NEW (02/05/07):  You can choose between the basic 15-haplogroup program, or a "beta" 21-haplogroup version.  The 21-haplogroup version adds six haplogroups to the previous 15: I1b1b, I1b2*, J2a1b, J2a1k, J2b, and K2 (and the previous I1b1 and J2 are replaced by the new subgroups).

NEW (12/15/06):  You can now display the data entry cells in either FTDNA order or numeric order.  Just click the appropriate button from the main page to bring up the program with the order that you want to use.

NEW (12/15/06):  The Haplogroup Predictor now fits its name a little better with the addition of the calculation of the haplogroup probabilities using a Bayesian approach.  The left column of results contains the traditional "goodness-of-fit" scores and the right column contains the probabilities.  Note that each fitness score is unaffected by any of the other scores, but this is not true of the probabilities.  The probabilities must add to 100%, so one probability may increase only at the expense of the others.

The probabilities should be interpreted as follows:  Suppose the program has returned the following probabilities-70% for I1a, 20% for G, 10% for R1b, and zero for the others.  This means that in a sample of 1000 men, all with the same haplotype (identical to the haplotype that was used in the analysis), if SNP tests were done to each, you would find on average that 700 of them would be I1a, 200 would be G, and 100 would be R1b.  Probabilities such as those used in this example would only be returned if it were possible that the same exact haplotype could exist in all three of those haplogroups (this could happen, for example, when the haplotype contained just a very few markers, but probably not for a multi-marker haplotype).  On the other hand, if one haplogroup gets a probability of 100%, it means that this haplotype probably only exists in the indicated haplogroup.

The Bayesian approach requires that an analysis start with the frequency of each haplogroup in the geographic region where the haplotype originated.  These frequencies are called the "prior probabilities" or the "priors."  This set of frequencies will be different for northwest Europe than for eastern Europe.  At the moment, the program is set for northwest Europe priors by default, but priors for a few other regions may be used.  There is a selection box near the top of the data entry page where the priors may be changed.

The new version eliminates the second round of haplogroup score calculations, because the Bayesian probabilities provide better information.

There are no allele frequency data available for a few markers in a few haplogroups.  Any number of markers may be entered and the fitness score can be computed based on whatever allele frequency data is available.  The fitness scores may not be based upon all of the marker values you have entered because of missing data in a few cases.  When a marker is not supported with data in all haplogroups, the marker will be colored orange after you enter a value.  The Bayesian approach requires that all haplogroups in the analysis have the necessary allele frequency data, so when you enter a value for a marker where data is not available for some haplogroup, that haplogroup is thereafter removed from the analysis.  The orange coloring mentioned above indicates such markers, and a "-" will appear in place of a probability for that haplogroup.  If you want the haplogroup that dropped out of the analysis to be included again, you must reset the markers colored in orange to zero and refrain from entering values for them.

Nomenclature Differences

Unfortunately, not all of the testing labs are using the same conventions on a few of the markers.  All of the allele frequencies used in the Haplogroup Predictor are based on the conventions used by FTDNA and Y-Search.  If you were tested by DNA Heritage or Relative Genetics, for example, your value on TAGA-H4 will be one repeat unit greater than it would be when reported as GATA-H4 by FTDNA.  For the 37 markers tested by FTDNA, only the H4 marker needs adjustment from the results reported by Relative Genetics and DNA Heritage (further adjustments may be necessary if you tested with Oxford Ancestors).  The Y-Search web site has more information on these nomenclature differences:

http://www.ysearch.org/conversion_page.asp

For markers other than the FTDNA-37, click the Conventions button from the main page.

Interpreting Results

The "fitness score" provides an estimate of how well your haplotype fits the values usually seen in the different haplogroups.  A typical haplotype that is actually in a given group, will normally have a score in the range 50-100.  The closer you are to the modal values, the larger will be the score.  Note that when using just 12 markers for prediction, only one or two variant marker values can result in a lower score.  Use all of the markers that you have to improve the prediction.

A score in the range 20-50 indicates a fair fit with that haplogroup.  If the highest score found is in this range, but the score is much greater than any other score, then you may well be in that haplogroup.  If you get two fairly close scores in the range20-50, it is difficult to distinguish between these two possibilities.  However, this still gives you information on the most likely possibilities, and probably rules out those haplogroups with low scores. Another possibility for low scores is that your haplogroup may not be included in the analysis. 

Technical Details

If you're interested in the technical details of how the fitness scores are calculated, read the article in the Spring 2005 issue of the Journal of Genetic Genealogy (link on the main page).

Similarly, the Bayesian approach is described in an article in the Fall 2006 issue of the Journal of Genetic Genealogy (link on the main page).

Feedback

Please provide feedback to wathey <at> hprg.com