To solve the problem of sparseness, Driskell et al built a "super-tree". The researchers took individual gene clusters and assembled them into subtrees, and then looked for sufficient taxonomic overlap to allow construction of a supertree. For example, using 254 genes (2777 sequences and 96,584 sites), the authors reduced the green plant supermatrix to 69 taxa from 16,000 taxa, with an average of 40 genes per taxon and 84% missing sequences! This supertree represents one of the largest data sets for phylogeny estimation in terms of total nucleotide information, yet it is the sparsest in terms of the percentage of overlapping data.
Despite the sparseness of this supertree, the authors are still able to estimate robust phylogenetic relationships that are congruent with those reported using more traditional methods. Computer simulation studies recently showed that, contrary to the prevailing view, phylogenetic accuracy depends more on having sufficient characters (such as amino acids) than on whether data are missing. Clearly, building a super-tree allows for an abundance of characters even though there are many missing entries in the resulting matrix. Adapted from here.