First, the gedcom file generated from a database before merging sources and citations using the RM8 tools is of an expected size, not 50 times that size.
I made a three-person tree in Ancestry and used TreeShare to bring it into RootsMagic to try to trace what Merge all duplicate sources and Merge all duplicate citations does and then does to the gedcom file.
Merge all duplicate sources found two (of ten) sources it thought were the same, and correctly so. The mystery is how they came to be considered different sources by TreeShare in the first place. They both are connected to the same marriage event, a family event, so OwnerType 1 in the EventTable. The two sources are a census and a marriage record. Both have multiple citations associated with them–all of the OwnerType 0 (individual) citations are associated with one instance of the source, while the one OwnerType 1 (family) citation is associated with an identical but separate instance of the source. Merge all duplicate sources recognizes them as the same, but doesn’t merge them completely. The two sources maintain a different though identical address (really repository) in the AddressTable: Ancestry.com. They also have the same media, downloaded under a different filename in the “path/file1” and “path/file1 (file 1)” convention. The effects I see in the RM database are that the Source table now has eight entries instead of ten, the two OwnerIDs for the duplicates in the AddressLink table point to two other entries, so there are eight OwnerIDs (the eight sources) instead of ten, and the last two citations in the CitationsTable have likewise changed SourceID, so that there are eight rather than ten SourceIDs in that table.
What does this do to the gedcom file? It’s a little smaller, because there are two fewer sources. The two sources that had duplicates now have two (identical) repositories attached to them, but that amounts to one line each in the gedcom. The files are otherwise the same except that the two duplicate sources appear under their new names: S9 is now S1 and S10 is now S4.
Merge all duplicate citations uses a similar approach but results in much larger differences. The count of citations in the CitationsTable shrinks from 37 to 12. Why 12? It appears to be the case that a CitationID is unique if it has a distinct CitationName and SourceID. For my simple database, that’s enough–one citation now correlates with one item of the associated media, not including duplicates of the form “path/file1 (file 1)”, since the MediaLinkTable also decreased, from 30 items to 14. The CitationLinksTable still has 37 entries, all pointing to the correct CitationID, and the URLTable, which functions kind of like the MediaLinkTable, also shrank to the correct number of _WEBTAGS (2).
So what’s the problem? The problem is who owns the MediaLinks and URLs and if it owns more than one. The OwnerType in each case is 4. I don’t know what that is, exactly, but I think it’s a CitationID. If the CitationID has one media item or URL associated with it, all is fine. If, however, it has two, it will list those two every time it’s used. If it has 100, it will list those 100 every time it’s used, which presumably is at least 100 times, both in the RootsMagic interface and in any gedcoms produced. The gedcom for my little three person tree increased 10% after merging sources and citations, and as mentioned previously, my main database increased by 5000%.
Here’s where I have to stop and get some feedback. This might be considered proper behavior by a developer–after all, if you’re a proper (meticulous) genealogist, you can make sure all your citations having more than one media item have unique names and one media item each after you use TreeShare–I think Merge all duplicate sources and Merge all duplicate citations would work fine. If, on the other hand, you’re me and all the findagrave citations don’t have a name but have unique URLs, your gedcoms will get big and you won’t be able to tell which media or URL goes with the instance of the citation you’re looking at. There might be an SQL solution for getting to one media item per citation after the fact, but I understand SQL like I understand my kids talking about pop culture–I can repeat it and kind of understand it, but I wouldn’t want to try making any up.