Last updated November 18, 2014 Download PDF
Identifying groups of descendants using pedigrees and genetically inferred relationships in a large database
Catherine A. Ball, Mathew J. Barber, Jake K. Byrnes, Peter Carbonetto, Kenneth G. Chahine, Donald B. Curtis, Ross E. Curtis, Julie M. Granka, Eunjung Han, Amir Kermany, Natalie M. Myres, Keith Noto, Yong Wang (in alphabetical order)
AncestryDNA™ offers several genetic analyses to help customers find, preserve, and share their family history.
The first is the inference of genetic ethnicity, or a statistical estimation of the historical origins of an individual's DNA. (See Ethnicity Estimate White Paper for more information.)
The second analysis delivered by AncestryDNA is known in the population genetic literature as identity-by-descent (IBD) analysis, or IBD matching. IBD analysis identifies pairs of customers who have long shared genetic segments suggestive of a recent common inheritance. Given the amount of DNA shared between two members, the analysis estimates how closely the two members might be related. By drawing connections between probable relatives, IBD analysis offers the opportunity for AncestryDNA members to expand their documented pedigrees. (To learn more about IBD analysis, see Matching White Paper.)
The purpose of this document is to explain how both IBD and pedigree information are used to identify groups of AncestryDNA members who are likely descendants of a single common ancestor. The set of these likely descendants is called a DNA Circle™, and is created by combining pedigree and IBD information from across the AncestryDNA member database. DNA Circles attempt to quantify the strength of each members' connection to a given ancestor as well as to other descendants of that ancestor.
In what follows, we first explain the motivation behind the development of DNA Circles. We then discuss the calculations forming the basis of DNA Circles and the algorithms behind DNA Circle creation. We conclude with a survey of more than 66,000 DNA Circles to address their relevance to AncestryDNA members.
2. Motivation for DNA Circles
A DNA Circle is a set of likely descendants of a given ancestor and is generated by combining pedigree and IBD information across the entire AncestryDNA database. A DNA Circle is always in reference to a particular ancestor. Each member of a DNA Circle has the ancestor in his or her online pedigree and shares DNA identical-by-descent with at least one other person in the circle. Here, we discuss the motivating goals of DNA Circles: (1) organizing a member's IBD matches, (2) providing supporting evidence for an individual's genealogical research, and (3) connecting distant genealogical relatives.
2.1. Organization of IBD Matches
The IBD analysis of AncestryDNA identifies segments of the genome that are shared IBD among pairs of AncestryDNA members, and uses the amount of IBD to estimate the degree of relationship between individuals. Given that this IBD analysis is performed across all pairs of AncestryDNA members, a member may have hundreds to thousands of identified distant relatives. While an individual may share DNA identical-by-descent with thousands of other living individuals, there are a limited number of ancestors through which the shared DNA could have been inherited. The number of an individual's ancestors (within a reasonable number of generations) is typically much smaller than the number of that same individual's living relatives.
Thus, in cases where the common ancestor is known, it is possible to group a number of IBD-identified relatives by the ancestor potentially responsible for the IBD. This is the first goal of DNA Circles. Such grouping of IBD matches allows a user to perform directed research on each ancestor through collaboration with each of their relatives in specific groups.
To illustrate this point, consider Figure 2.1, a graphical representation of the genetic inheritance of the genomes of four ancestors. Each genome is represented as a contiguous block of genomic sequence, or a haplotype. After recombination in the ancestors, blocks of the original ancestral genomes are inherited by the children in generation 2. Four to six generations later, those blocks of the original ancestral genomes are present in a set of three descendants.
If two individuals have inherited the same segment of DNA from a common ancestor, that DNA is said to be identical-by-descent, or IBD. In the example, individual D shares DNA IBD with individuals C and E. Assuming that pedigree information is available, one could identify common ancestors of any two individuals by tracing back through their pedigrees. The first common ancestor encountered is termed the most recent common ancestor, or MRCA. In tracing back through the pedigrees of C, D, and E, ancestor A would be identified as D's MRCA with C, as well as D's MRCA with E.
By grouping D with C and E into a DNA Circle, D would be able to distinguish these two individuals (C and E) from the other thousands of individuals with whom he shares DNA—and classify them as being related to him through ancestor A.
2.2. Supporting Evidence for Genealogical Research
Second, an AncestryDNA member's pedigree is often the result of detailed genealogical research involving the integration of information from various historical documents to support or identify ancestral lineages. In combining IBD and pedigree information across the entire AncestryDNA member database, DNA Circles can be considered another piece of supporting evidence for a member's pedigree—in addition to historical records.
This concept is also exemplified by Figure 2.1. The identification of ancestor A as the MRCA between individuals C and D lends evidence to support that C and D may have both inherited DNA from that ancestor. We then add the knowledge that individual E also has an IBD segment with individual D (in addition to sharing ancestor A as their MRCA). This lends an additional piece of evidence for ancestor A as part of individual D's pedigree, and consequently C's and E's.
Thus, the concordance of two independent pieces of information (pedigree relationships and patterns of IBD sharing) among a set of individuals (here, the DNA Circle including individuals C, D, and E; Figure 2.1) can serve as supporting evidence for documented pedigree lines. We note that as more descendants of ancestor A join the DNA Circle by sharing A as their MRCA as well as IBD with an existing member (here, C, D, or E), the amount of evidence supporting the ancestor as part of the descendants' pedigrees can continue to increase.
2.3. Connecting Distant Genealogical Relatives
One pertinent point relating to IBD is that even if two AncestryDNA members are both descended from a particular ancestor, they may not necessarily share any DNA inherited IBD (e.g., individuals C and E in Figure 2.1). Due to the randomness of genetic inheritance, one doesn't necessarily share DNA with all of one's distant cousins, particularly fourth cousins and beyond. While sharing DNA identical-by-descent is often evidence of the relatedness of two individuals, the lack of an IBD match does not necessarily imply a lack of a distant genealogical relationship.
Thus, the third and final goal of DNA Circles is to allow relatives who do not share identical-by-descent stretches of DNA to collaborate with one another. Figure 2.1 also demonstrates this goal. While individuals C, D, and E have all inherited DNA from ancestor A, not all pairs of descendants share DNA IBD (i.e., individuals C and E). However, as members of a relationship group around ancestor A, individuals C and E could examine their suggested genealogical relationship despite the fact that they are not an IBD match. DNA Circles thus open the possibility for AncestryDNA members to identify distant relatives with whom they do not share DNA IBD directly, but with whom they still have genetic evidence supporting their relationship.
3. Methods: Components of DNA Circles
First, we describe the methods which underlie the two main components of DNA Circles (see Figure 3.1):
- scored most recent common ancestors (MRCAs) identified between pedigrees of individuals who share DNA identical-by-descent and
- close family groups.
3.1. Scored Most Recent Common Ancestors (MRCAs)
For two individuals who share DNA identical-by-descent (see Section 2), we can estimate the number of generations separating the two individuals from their common ancestor. However, their IBD segments do not themselves reveal the identity of their common ancestor. Pedigree information for each individual enables a search for the two individuals' documented common ancestors. However, even if an MRCA is identified, that ancestor may not actually be the one from whom they inherited their shared DNA.
Thus, for every MRCA identified between two people who share DNA IBD, we calculate a score quantifying whether the two individuals share DNA IBD because they both inherited that DNA from the discovered MRCA.
The score is calculated using a combination of three different statistical weights, each scaled to lie between 0 and 1 (see Figure 3.2):
- W(IBD): the confidence that the IBD match is due to recent genealogical history,
- W(SharedAncestor): the confidence that the identified MRCA represents the same person in both online pedigrees, and
- W(Inheritance): the confidence that the individuals have an IBD match due to the shared ancestor in question (as opposed to from another ancestor or from more distant genealogical history).
Below, we describe the derivation and scoring of each of these weights and how they are used to calculate the score of an MRCA.
Do the individuals share DNA identical-by-descent?
When a DNA sample is submitted to AncestryDNA, genotypes are obtained for more than 700,000 genome-wide markers on the Illumina Omni-Express platform (see Ethnicity Estimate White Paper for more details). Given this genome-wide data, AncestryDNA performs an IBD analysis between all pairs of AncestryDNA members. This identifies pairs of individuals who have long identical segments of DNA likely inherited from a shared common ancestor in the recent past; i.e., inherited identical-by-descent (IBD) (see Figure 2.1). An IBD match is identified when a pair of individuals share a sufficient amount of DNA IBD to be considered related, and can be classified by the degrees of separation between the individuals: i.e, parent/child, siblings, first cousins, and so forth.
Though identified as IBD matches, each pair of individuals may have varying amounts of evidence that the shared DNA between them is actually identical by descent (rather than spurious matching or identical by state). To reflect this, we calculate a statistic W(IBD), a value between 0 and 1 which represents the confidence that the IBD match is due to relatedness through a recent common genealogical ancestor. As IBD matching algorithms at AncestryDNA are improved and modified, it is possible that the value of W(IBD) between any two members may change over time (see Section 6).
For a more detailed discussion of IBD analysis, estimation of the degree of separation, and the confidence scores, please see the Matching White Paper.
Do the individuals have a common ancestor?
As mentioned in Section 2, AncestryDNA members often build online pedigrees and associate these pedigrees with their DNA test (i.e., identify the individual in the online pedigree who submitted the DNA sample). These pedigrees are often the result of detailed genealogical research that integrate information from various historical documents or other Ancestry members' family trees.
Online pedigrees of AncestryDNA members have varying levels of accuracy and completeness. For example, some members may have limited information about their family history, and thus will have entire branches of their online pedigree missing or unknown. Conversely, others will have detailed pedigrees spanning multiple generations with thousands of identified relatives in collateral lines.
Once IBD is identified between two AncestryDNA members with associated pedigrees, the two pedigrees are compared to identify the set of most recent common ancestors (MRCAs) between them. Identifying MRCAs among pedigrees is not trivial given that each customer has their own unique pedigree, with potentially different documentation about their ancestors.
When looking for MRCAs between two pedigrees, we only search among the direct-line ancestors (i.e., parents, grandparents, great-grandparents, etc.) in each individual's pedigree (see Figure 3.3). This is because direct-line ancestors are the only relatives from whom an individual inherits DNA. Additionally, we only look for MRCAs in the most recent ten generations of the two pedigrees. This is for several reasons: (1) in more distant generations, the customer is much less likely to have inherited any DNA from a particular ancestor; (2) DNA inherited from a more distant ancestor may not be sufficient to indicate a shared lineage with other cousins; and (3) accuracy and dependability of customer pedigrees decreases as we go further into the past.
For all pairs of direct-line ancestors in the pedigrees of each individual, Ancestry uses a proprietary algorithm to identify individuals in two pedigrees who are likely the same person. Information from the online pedigrees used in this comparison includes vital information and relationships. Given the concordance in information between two nodes of a pedigree, the algorithm gives a score quantifying whether the ancestor in the two pedigrees is the same person. Here, we represent this score, scaled between 0 and 1, as W(SharedAncestor). If the information about an ancestor in two pedigrees matches perfectly, the comparison will likely receive a perfect W(SharedAncestor) score of 1.
When comparing two pedigrees, we only consider the most recent common ancestors (MRCAs)—not all common ancestors. To understand why this is important, consider a mother-daughter relationship. The entire set of common ancestors of the mother and daughter include the mother and all of the mother's direct line ancestors. Since the mother's ancestors are redundant information, only the mother would be denoted as the MRCA.
However, it is possible to have multiple MRCAs on different lines in two pedigrees. In this case, we consider all MRCAs that are identified on each line of the members' pedigrees. While the most common case of this is when two individuals are descendants of a single couple, there are also many cases where different common ancestors may be found on two different branches of a pedigree. The example in Figure 3.4 illustrates two individuals who share four individuals in their pedigrees who are MRCAs: A, B, C, and D.
Given that MRCAs are identified using online pedigrees, for which members may modify or update their documentation, the search for common ancestors between pairs of pedigrees is repeated periodically. For each pair of individuals identified as an IBD match, the results of this search for MRCAs, along with W(SharedAncestor), are recorded and stored in preparation for DNA Circle analysis.
Do the individuals share DNA identical-by-descent from the MRCA?
The third component of the MRCA score is W(Inheritance). In the context of W(Inheritance), we treat each MRCA couple (a pair of mating MRCAs) as an entity, rather than examining individual MRCAs (as done for W(SharedAncestor)). This is because both members of a couple pass down DNA to their descendants, and it is difficult to tease apart which individual has contributed the DNA shared IBD between their descendants. W(Inheritance) scores whether two individuals inherited their IBD segments from the MRCA couple in question.
To understand the motivation behind W(Inheritance), consider two individuals who share DNA identical-by-descent through an MRCA couple other than the one identified in the two pedigrees. One reason this might occur is through mating patterns. Throughout human population history, it is common to find instances of one family marrying into another more than once across several generations, resulting in member pedigrees potentially having MRCAs on two or more different family lines.
In addition, large parts of customer pedigrees are often unobserved. For a variety of reasons including the dreaded "brick walls" of genealogical research, a member's ancestors along particular lineages may be unknown. (A majority of member pedigrees have 100–200 ancestors in their direct lines; see Section 5 and Figure 3.5). In this case, even if an MRCA is identified between two AncestryDNA members, their IBD could be a result of DNA inherited from a couple who is unobserved in one or both pedigrees (see Figure 3.6).
To calculate W(Inheritance), which is also scaled to lie between 0 and 1, we use a proprietary algorithm that takes into account (1) direct-line pedigree size, (2) number of shared ancestral couples, and (3) the generational depth of the shared MRCA couple. Each of these factors contributes to our confidence about whether the two individuals inherited their IBD segments from the MRCA couple in question.
3.1.4. Score of a Most Recent Common Ancestor
The final score (ei↔j) for each identified MRCA between two individuals i and j who share IBD is a product of the three components described above:
To construct DNA Circles, we take a conservative approach and only include pairs of individuals where there is strong evidence that their IBD segments were inherited from the given common ancestor (i.e., ei↔j ≥ t, where t is some predefined threshold, and ei↔j is calculated using Eqn. 3.1). While using this threshold possibly removes pairs of individuals from DNA Circles (see Section 4), it also ensures a more certain foundation for DNA Circle construction.
3.2 Family Group Identification and Linking
The second building block of DNA Circles are the DNA Circle members themselves. Members of a DNA Circle can be both (1) individual AncestryDNA members as well as (2) "family groups," which group individual AncestryDNA members inferred to be part of a family to combine their non-independent genetic information. Family groups are an aggregation of individuals who have pedigree evidence and inferred IBD suggesting that they are first cousins once removed or closer. Since individual members of DNA Circles require no further discussion, this section is devoted entirely to family groups.
3.2.1 Family Groups: Intuition
There are two reasons for creating family groups. First, members of a close family share a large amount of DNA identical-by-descent. For example, a child inherits 50% of their genome from each of their parents and approximately 25% of their genome from each of their grandparents. First cousins thus share approximately 12.5% of their genomes IBD from their shared grandparents. Because of this large amount of IBD, close relatives give redundant information regarding distant ancestors. Second, members of a family generally know one another and are confident in their pedigree relationships. As we later explain (see Section 4), collapsing closely related individuals into family groups, and allowing them to connect with other members of a DNA Circle as a single entity, enables any given family group member to connect into more DNA Circles than they would alone.
3.2.2 Creating Family Groups
A family group is in reference to a particular most recent common ancestor. We calculate family groups based on IBD matches with identified MRCAs, where IBD suggests relationships estimated to be at a first cousin once removed level or closer. This threshold was selected for two reasons: (1) we wanted members of close family groups to be related within 2–3 generations to maintain high tree accuracy (most members know who their grandparents are), and (2) analyses suggest that given the large amount of IBD between them, we identify nearly all true first cousin relationships at this level (recall > 99%). While we miss some first cousin once removed relationships at this level, recall is still high (>82%).
We discuss the creation of family groups with an example. Figure 3.7 shows a pedigree with 5 AncestryDNA members A, B, C, D, and E, as well as ancestor F (not an AncestryDNA member). Table 3.1 shows an example set of results from an AncestryDNA IBD analysis among only individuals A, B, C, and D.
To create a family group, we consider each pairwise relationship where there is both IBD to suggest a first cousin once removed relationship or closer, as well as an identified MRCA (see Table 3.1). In practice, the order in which pairs of individuals are examined and added to family groups is random.
|Member 1||Member 2||Relationship Level Estimated from IBD||Identified MRCA|
The following steps would be followed to analyze these identified IBD matches and create a family group. First, B and C would form a family group with A as the MRCA, since they are estimated to be related closer than first cousins once removed (see Figure 3.8). Since B shares IBD with D and has the same MRCA (A), D is then added to the family group. A's IBD match with B adds A to the family group for the same reason. A, B, C, and D will thus all belong to the same family group. Note that the aggregation of these individuals into a family group would only be performed for DNA Circles for ancestors along A's direct line.
3.2.3 Family-to-Family Connections
Sometimes, we identify relationships between family groups that add the members of one family to another—such as when the MRCA of one family is found to be a descendant of another. For the sake of explanation, consider Figure 3.7, where A has a sister E. A and E would create a family group around their MRCA (their mother, F). Though B, C, and D would all share IBD with E, it is unlikely that they would share enough IBD with E to be in a family group (as they are not related closer than first cousins once removed). However, since B, C, and D all belong to a family group with A—and A belongs to a family group with E—A's family group would become a "sub-family" to F's (the mother's) family.
3.2.4. Limitations of Family Group Creation
There are a number of limitations to the creation of family groups. As with the identification of MRCAs (see Section 3.1), family groups can only be created when AncestryDNA members have associated their sample with a node in their pedigree. Family relationships, and thus DNA Circles, will be compromised if one associates their sample with an inaccurate node in the tree representing someone other than the person who took the AncestryDNA test. As with all analyses relating to DNA Circles, tree quality is also an important caveat and limitation (see Section 6).
From initial analyses, it appears that members with well-vetted pedigrees can benefit as members of family groups. AncestryDNA members who belong to family groups are often members of more DNA Circles than members who do not belong to family groups (see Section 5).
4. Methods: DNA Circle Construction
Given the building blocks of DNA Circles described in Section 3, we describe the steps taken to create DNA Circles and calculate relevant scores.
First, we introduce the concept of a weighted graph, or network, which is a useful means of visualizing and representing a DNA Circle (see Figure 4.1). Members (both individuals and family groups; see Section 3.2) are represented by nodes, and the weighted edges between the nodes are the scores of the MRCAs (ei↔j; see Section 3.1). Using this framework, we discuss how DNA Circles are created and how their associated scores are calculated (see Figure 4.2).
4.1. Adding Nodes and Edges
We reiterate that a DNA Circle is always in reference to a particular ancestor. We aggregate pairs of individuals who not only share IBD but also have an MRCA at the same ancestor, where there is sufficient evidence to support that the pairs of individuals in the group actually did inherit DNA from the MRCA (see Section 3.1).
The construction of DNA Circles is performed in a similar manner for family groups. Given a list of pairs of nodes (family groups and individuals) who share both IBD and an MRCA with an MRCA score ei↔j ≥ t (see Section 3.1), we iterate through each pair in arbitrary order. We persist records of which members are connected to whom (and at what strength) in each DNA Circle. By definition, a DNA Circle must have three or more nodes. Note that we only create DNA Circles around ancestors who are less than or equal to 6 generations back from all nodes (i.e., we omit pairs of nodes if their MRCA is more than 6 generations back).
Iterating through each pair of potential members creates new networks in addition to adding nodes and edges to existing ones (see Figure 4.3). When an MRCA is found that is already the head of a DNA Circle, we add the members. Given the random order in which DNA Circles are constructed, multiple DNA Circles may be merged in this process (see Figure 4.3). A final DNA Circle is a collection of nodes and weighted edges between those nodes.
Since the AncestryDNA member database is not static (members may update their online pedigrees and new members may join at any time), DNA Circles are re-constructed in this manner on a regular basis.
4.1.1. Family Group Nodes
We note that family groups (see Section 3.2) are treated as a single node of the network (i.e., a single "member" of the DNA Circle)—even though they are made up of several individual AncestryDNA members. In Figure 4.1, family groups are shown as an icon with multiple individuals within a node. A family group will have an edge to another node of the network (family or individual) when at least one of the members of the family group has ei↔j ≥ t with that other node.
When multiple members of a family group are connected to the same other node (individual or family group) in the DNA Circle, the highest MRCA score is used to score the edge connecting that other node to the family group node. In other words, family group node k's connection to node j is equal to:
Take again the family group example in Figure 3.7 (see Section 3.2). While A, B, C, and D might all share IBD with other members of the DNA Circle, all four people will be treated as one node of the network.
A consequence of family groups is that although one individual of a family group might not share IBD with any other members of the DNA Circle, they can still be a member of the DNA Circle. In the example of Figure 4.4, even though D does not share IBD with anyone in the DNA Circle, since he is part of the family group around A, he is a member.
In general, combining closely related individuals into family groups has the consequence of increasing the number of DNA Circles to which family group members belong (as in the example above; see also Section 5). By definition, a family group will have the same number of IBD matches or more than any one individual within it. As another example (see Figure 3.7), even just grouping B and D into a family group allows them to collectively connect with more IBD matches on A's line (since B and D randomly inherit a potentially different 25% of A's DNA).
There may be occasions when different family group members have different distant ancestors at a given position in their pedigrees. When there are disputes between pedigrees of family group members, the construction algorithm greedily selects the first MRCA it observes and ignores any connections that disagree. To continue with our example, C's pedigree may indicate that A's mother is G instead of F. If C's connection into G's DNA Circle happens to be observed first, the whole family group will join a DNA Circle around G, and not the DNA Circle around F. If a member makes modifications to their pedigree (i.e., if C replaces G with F), the family could then join a different DNA Circle (in this case, a DNA Circle around F).
4.2 Scoring Members of a DNA Circle
Once a DNA Circle is created, we score each member's membership to the circle. This membership score is aimed at quantifying the amount of evidence that an individual is related to all of the members of the DNA Circle through the common ancestor of the circle—and thus is likely to be a descendant of that ancestor.
4.2.1. Score of the DNA Circle
To calculate the membership score for each node, the first step is to score the entire DNA Circle.
We base the score of the entire DNA Circle on (1) the size of the DNA Circle and (2) the strength and number of connections that make up the DNA Circle.
First, size of the DNA Circle is an important factor. Intuitively, large circles contain more collective evidence than small circle. While we impose rigorous thresholds on adding nodes to the circle (by requiring large ei↔j), confounding connections could still potentially exist as a result of errors in member pedigree data or spurious IBD identification. For example, circles of size three have at most three pairwise connections between individuals, an observation that could be due to chance. Adding more members to the circle is effectively akin to increasing sample size, lending more evidence that the existing connections may be due to IBD from the same common ancestor. Thus, the score of a DNA Circle is designed to give larger circles a higher score.
Second, we consider the "connectivity" of the network: the number and strength of the connections between nodes (see Figure 4.5). Similarly, more connections between individuals are also akin to an increase in sample size. The more connections between nodes in a DNA Circle (and the stronger those connections), the less likely it is that any one of those connections is due solely to chance.
We note that while only edges ei↔j ≥ t involving the MRCA of the DNA Circle are initially used to create the network, to score the DNA Circle, all edges ei↔j > 0.0 are used. (In cases where the MRCA of i and j is a descendant of the MRCA of the DNA Circle, ei↔j is scaled by the number of generations separating the two MRCAs).
The score for each DNA Circle (groupscore) is calculated using the equation:
In Eqn 4.2, n is the number of members (nodes) of the DNA Circle, and ei↔j are the MRCA scores between individuals i and j in the DNA Circle (as described above; see also Section 3.1). In the denominator, f(n) is a weight that is a function of n. While the total number of edges possible in a group of size n is n(n-1)/2, we calculate f(n) such that small groups need nearly complete connectivity to score well, while larger groups are granted more leeway. In the numerator, the weight w is also a function of n, and penalizes DNA Circles with only 3 members (i.e., n = 3).
The groupscore behaves such that as more nodes are added to the DNA Circle, the score of the group increases. As we explain in Section 4.2.2., this groupscore contributes to the membership score of each member of the DNA Circle—allowing any one member's score to increase as more and stronger connections are added to the DNA Circle as a whole.
4.2.2. Membership Score Calculation
Given a score for the entire DNA Circle, we then calculate a membership score for each node of the DNA Circle. This incorporates the groupscore itself (Eqn. 4.2), as well as the strength and number of connections from the given node to each other node of the DNA Circle.
We calibrated this membership score by examining a set of DNA Circles of various sizes. Using this data, we define a linear function g(n) that describes the expected sum of all edges (ei↔j) from an individual node i to all other nodes j of a DNA Circle of size n.
Given this function g(n), we define the membership score for individual i as:
where again groupscore is the DNA Circle score calculated in Eqn. 4.2, and ei↔j is the MRCA score between individual i and all other individuals j in the DNA Circle (see Section 3.1 and Eqn. 3.1). While somewhat ad hoc, the membership score has the property that nodes in DNA Circles with high groupscores have higher scores, as do individuals with larger summed connections to other members of the circle.
To calculate membership scores for members of family groups, the family group's membership score is used. In other words, Eqn. 4.3 is used with the maximum value of ei↔j for all connections to the family group node (see Eqn. 4.1 and Section 4.1.1).
Figure 4.6 shows a distribution of membership scores (scorei) calculated from more than 66,000 DNA Circles from more than 63,000 AncestryDNA customers who consented to participate in research (see Section 5). The heavy skew of scores at lower values emphasizes the conservative nature of the score as designed: high scores are rare and only granted given a large amount of supporting evidence.
4.2.3. Membership Score Interpretation
The membership score is shown to each member of the DNA Circle to quantify the relative amount of evidence supporting that they are related to other members of the circle via the shared ancestor. Rather than reporting numerical values, the score is shown in qualitative categories: "weak," "some," "emerging," "good," and "strong" evidence. These categories are determined using an empirical strategy based on how a membership score compares to other membership scores from DNA Circles of a comparable size.
In Figure 4.7, we show the distribution of membership scores for individuals who belong to DNA Circles of given sizes (3 members, 4 members, 5 members, and so on) for more than 66,000 DNA Circles made up of more than 63,000 AncestryDNA members who consented to participate in research (see Section 5). By design (see Eqn. 4.3), as DNA Circles increase in size, the membership scores within those DNA Circles increase. The exception to this trend is for DNA Circles with greater than 100 nodes (see Figure 4.7); this is likely due to the relatively sparse connectivity of circles of this size.
We base the qualitative membership score interpretation on how the score compares to those found in similarly sized DNA Circles. For large scores, the score interpretation is similar across circles of all sizes. For example, 0.9 is in the upper percentile of scores in all DNA Circles. Since a member with this score is more strongly well connected to the DNA Circle than other members (see Eqn. 4.3), a member with this score would receive an interpretation of "strong" (as would any scores ≥ 0.7). Similarly, "good" evidence is defined as membership scores ≥ 0.4 and < 0.7.
However, for low scores, we must disentangle whether a member's score is low because the DNA Circle is small, or because the member actually has weak evidence as a member of the DNA Circle. To reflect this possibility, we interpret small scores in smaller DNA Circles as "emerging" evidence, where the exact interpretation of the low score is less clear. "Emerging" evidence also reflects that as a DNA Circle increases in size, a score within it has the potential to increase. In contrast, small scores in larger DNA Circles are interpreted as "very weak" or "some" evidence (see below).
To determine the DNA Circle size boundary between these two interpretations, we performed a leave-one-out experiment assessing the stability of DNA Circles of different sizes. We calculate the proportion of DNA Circle members that, when removed, entirely eliminate the DNA Circle (see Table 4.1).
|DNA Circle Size (n) *||Proportion of DNA Circle Members That, When Removed, Eliminate the DNA Circle|
*DNA Circle size of three not shown. Since a DNA Circle has greater than or equal to three members, removing any member from a DNA Circle of size three eliminates the DNA Circle.
Once a DNA Circle reaches eight members, > 1% of members eliminate the DNA Circle when removed—compared to ~30% of members for DNA Circles of size 4. In other words, DNA Circles with eight or more members are relatively stable. This suggests that a low score in a DNA Circle with size n ≥ 8 may actually reflect less evidence for membership to the DNA Circle. In contrast, for the more unstable DNA Circles of size seven or smaller, low scores (> 0.4) could also be because the DNA Circle formed due to chance or because not enough descendants of the ancestor have yet been tested; thus, the score is interpreted as "emerging."
For a concrete example, take a membership score of 0.1 (calculated using Eqn. 4.3). This score in a 30-member DNA Circle is strangely low given that other members of 30-member DNA Circles have larger scores (see Figure 4.7). Since this member is less strongly connected to the circle than expected, this member has less evidence for that DNA Circle and is penalized ("weak" evidence). However, for a score of 0.1 in a smaller DNA Circle, the exact interpretation of the score is less clear—and thus the score is classified as "emerging" evidence.
To determine boundaries between "weak" and "some" evidence for DNA Circles of size 8 or more, we again use the empirical distributions in Figure 4.7 as a guide; we interpret scores < 0.2 "very weak" and scores ≥ 0.2 and < 0.4 as "some" evidence.
4.2.4. Discussion of Membership Scores
As discussed in Section 2, one of the primary goals of DNA Circles is to provide AncestryDNA members with a supplemental piece of evidence for or against a given ancestor as part of their pedigree. The membership score can be interpreted as the amount of evidence supporting that ancestor as part of a member's pedigree.
Since DNA Circles are continually updated and recalculated (see Section 6), an individual's membership scores to a given DNA Circle may change over time. As a DNA Circle grows, members who have actually inherited DNA from a given ancestor may find that their score interpretation improves (see Section 6). This is not only because the groupscore will increase (see Eqn. 4.2), but also because any given member may share IBD with new AncestryDNA members who are also descendants. We note that a member's score interpretation can in fact decrease if their membership score falls into the lower end of the distribution of membership scores (see Figure 4.7).
As a consequence of using family groups in the construction of DNA Circles (see Section 3.2), individuals who are part of a family group will generally have larger membership scores. By grouping their IBD matches with those of their family members, an individual in a family group has more potential connections to other members of the DNA Circle (as explained in Section 4.1.1). If those family members include relatives along their direct line, those connections also have the potential to be stronger than their own (i.e, larger ei↔j). See Section 5 for further details on the impact of family groups on membership scores.
5. Results: Survey of DNA Circles
To assess the performance of DNA Circle creation, we examined 196,843 genotypes from AncestryDNA members (with non-private pedigrees) who consented to participate in the Ancestry Human Diversity Project. Our goal was to understand the factors affecting construction of DNA Circles, the average sizes of these circles, the effects of family groups on DNA Circle construction, and how DNA Circles may be expected to change over time. In our interpretations, we assume that the survey of these approximately 200,000 randomly selected AncestryDNA members is representative of the present and future AncestryDNA database.
5.1 Effect of Pedigree Size
We consider the effect of direct-line pedigree size on the number of identified MRCAs and the number of DNA Circles for each AncestryDNA member.
Finding MRCAs for a pair of individuals who share DNA inferred to be inherited identical-by-descent is a key component of DNA Circles (see Section 3). Thus, we first examined the frequency of observed MRCAs between individuals sharing IBD.
Of the 196,843 individuals who had a public pedigree associated with their DNA sample, 69,002 individuals (35%) did not have any MRCAs associated with any of their IBD matches; as a result, these individuals will not participate in DNA Circles. In contrast, the remaining 127,840 individuals (65%) had at least one MRCA associated with at least one of their IBD matches, and thus could potentially participate in DNA Circles.
These two groups (those with and without MRCAs) were noticeably different in terms of their direct-line pedigree sizes (see Figure 5.1). For example, the mean direct-line pedigree size (going back nine generations) for the group without MRCAs was 31 (median = 14); however, the mean direct-line pedigree size for members with at least one identified IBD match with an MRCA was 115 (median = 87). Intuitively, this suggests that as members extend their smaller pedigrees, they may begin to receive additional MRCAs associated with their IBD matches.
Given this pattern between pedigree size and having at least one MRCA associated with any one of their IBD matches, we were interested in finding a relationship between direct-line pedigree size and the number of IBD matches with an identified MRCA (see Figure 5.2). This correlation is fairly strong (r2 = 0.542). (In this analysis, we included only individuals that have at least one IBD match with an MRCA). On average, this suggests that an individual could gain one IBD match with an associated MRCA for every three ancestors added to their direct-line pedigree.
These results suggest that individuals with pedigrees that are well vetted (and full) along their direct lines have the potential to join more DNA Circles. On average, we find that for every 50 additional people in a direct-line pedigree, an individual is a member of one more DNA Circle (see Figure 5.3). While this relationship does not necessarily imply that adding 50 new people to one's pedigree will result in a new DNA Circle, it suggests that larger direct-line pedigrees may enable one to participate in more DNA Circles.
5.2 Effect of Family Groups
Given that they are another building block of DNA Circles (see Section 3.2), we also assessed the impact of family group membership on DNA Circles. Out of the 196,843 AncestryDNA members of this study, 39,203 unique individuals (~20%) were a part of 31,435 family groups. While most individuals belonged to only one family group, some belonged to many more. The most common family groups were groups of siblings, or of parents and children. Of all family groups, 79% contained only two living members.
Individuals with at least one family group have more DNA Circles than those with no family groups. While the mean number of DNA Circles for all members without a family group is 2.39, the mean for those with at least one family group is 3.32 (p-value < 2e-16). In other words, individuals with at least one family group have on average one additional ancestor in their pedigree with an associated DNA Circle. As discussed in Sections 3.2 and 4.1.1, this is because family groups aggregate the IBD matches of all family group members.
We also find that across all DNA Circles, individuals within a family group have on average more edges within a DNA Circle than those not in family groups (0.3 more edges, p-value < 2e-16; see Section 4.1.1). Recall also that the membership scores for family group members involve the strongest edge scores of the family group (ei↔j; see Section 4.1.1). As a result, members within family groups have stronger membership scores to their DNA Circles than those without family groups (0.02 points higher, p-value < 2e-16; see also Sections 4.2.2 and 4.2.4).
5.3 General Properties of DNA Circles
In our sample of 196,843 AncestryDNA members, we identified a total of 66,184 DNA Circles involving 63,965 unique individuals (32% of all the members in our sample). Note that the theoretical maximum number of individuals that could be a member of a DNA Circle was 65% (those with IBD matches with corresponding MRCAs; see Section 5.1). Among the 63,965 members with at least one DNA Circle, the average number of DNA Circles per person was 5.1.
Most (62%) DNA Circles contained only three nodes, the minimum number of nodes for a DNA Circle. (Note that some of these DNA Circles may contain more individual members if a family group is one of the three nodes.) A total of 1,823 DNA Circles contained more than 10 members; these DNA Circles involved 18,399 unique individuals. In other words, 9% of the AncestryDNA members part of this study were involved in a DNA Circle of more than 10 members.
See Figure 5.4 for a distribution of DNA Circle sizes in this dataset. At the time of this study, a majority of identified DNA Circles are of a small size; see Section 4.2.3 for a discussion of the interpretation of membership scores in small DNA Circles.
To understand the growth in the number of DNA Circles with respect to database size, we recalculated DNA Circles from subsamples of the 196,843 customers in increments of 25,000. The growth in the number of total DNA Circles starts off slowly, and appears to grow at a constant rate from ~75,000 to 175,000 members (approximately 10,000 DNA Circles are added for every 25,000 customers) (see Figure 5.5). While the projection of this growth to even more AncestryDNA members is unclear, we speculate that this growth may slow in the distant future: instead of the raw number of circles growing, the number of individuals in those circles may grow. These rates of growth will depend on database composition.
In summary, we have discovered several informative trends from this initial study of DNA Circles. In general, members with more complete direct-line 10-generation pedigrees and those who are members of family groups tend to belong to more DNA Circles, as well as have higher membership scores in those circles. As the AncestryDNA database continues to grow, we expect a greater number of DNA Circles, as well as larger DNA Circles, in the database at large.
6. Discussion and Future Refinements
In this document, we have discussed a conservative approach to identify and group AncestryDNA members who are likely to be descendants of a particular ancestor. DNA Circles, in addition to organizing member matches and connecting individuals with distant cousins, may help to provide a supplemental piece of evidence for a particular ancestor as part of one's pedigree. Here, we discuss several caveats of the DNA Circle analysis as well as the future of DNA Circles.
First, we acknowledge that several of the calculations forming the basis for DNA Circles are based on ad-hoc principles. Nonetheless, results of extensive testing (see Section 5) have demonstrated the utility of DNA Circles and validated their performance.
Second, AncestryDNA member pedigrees and IBD estimation between AncestryDNA members are key components of DNA Circles. While any inaccuracies in member pedigrees may be made less problematic by the aggregation of information across many pedigrees, pedigree inaccuracies could still result in spurious connections in DNA Circles. In addition, using IBD estimation to identify pairs of individuals who are likely relatives is a sophisticated scientific problem. As IBD estimation at AncestryDNA improves over time (see below), there may be modifications made to the connections forming the basis of DNA Circles.
6.2. Changes in DNA Circles Over Time
Each DNA Circle is based on a current snapshot of the AncestryDNA member database and their respective pedigrees. As the AncestryDNA community grows, and as AncestryDNA member pedigrees change over time, DNA Circles will also evolve (see Section 5). We continuously update DNA Circles so that each one is calculated from the most current data available.
6.3. Scientific Improvements
Most importantly, the AncestryDNA science team is hard at work not only on the science behind DNA Circles, but also on the science behind identifying DNA that is identical by descent (IBD) among individuals. As algorithms behind both of these features are improved, we expect that future changes and modifications to DNA Circles will allow AncestryDNA members to make even more varied discoveries about their family histories.