Gedcom bloat in RM8 due to citation webtags

I have been using RM8 for a while with no huge issues. However, when I exported my first gedcom, it’s approximately 50 times the size of previous (RM7) gedcom files. This is due mainly to every source including every citation for every individual and in particular their webtags and urls. For instance, if SOUR @S7@ is the 1860 US Census, every _WEBTAG and URL pair for every citation is listed under every individual who has that source, when I would expect (at most) just the webtag and url for the particular individual. Though maybe gedcom doesn’t know the parent (individual) of its parent (source). Media behaves similarly–I get all media for all citations of a source for any individual or family who uses the source.

I would expect this to be a common problem, but I have been searching in this forum and in sqlitetoolsforrootsmagic, but either it’s not a common problem or I’m not recognizing how it’s described by others. I think it must be some sort of misconfiguration on my part. The result is that when I input that 50X gedcom into TNG for display, individuals have a lot of media associated with them that shouldn’t be, specifically all media for all citations for any source associated with them. Any help is appreciated.

After importing your RM7 database into RM8, did you run the Merge Duplicate Sources tool and the Merge Duplicate Citations tool? I suspect that running these tools will make a big improvement. The Merge Duplicate Citations tool is the one that is likely to make the most impact.

These tools may be found via Sources => THREE DOTS.

1 Like

That is more likely the cause of the problem!

1 Like

As I think about it, I suspect Tom is correct about that. The issue is that reusable citations are not really supported by GEDCOM so RM8 has to break the reusable citations that have been merged together back into separate citations when it makes a GEDCOM. Therefore there is no space to be saved in the GEDCOM file due to reusable citations.

That being said, I just made GEDCOM from my RM7 database and my equivalent RM8 database - 77,266KB vs. 78,302KB. I don’t use WebTags, but I have a lot of media. Maybe WebTags are creating more of a bloat problem than media files?

Most of my media is tagged to citations and I memorize and paste a given citation into lots of different places, especially for things like obituaries. The media tags for those citations hence are replicated many times.

I don’t use TNG, but I do use GedSite. All those citations I have memorized and pasted are obviously replicated properly into GedSite via GEDCOM, but GedSite stores each distinct media file only a single time and just links to a given media file as many times as I have citations to the media file. Are you saying that TNG replicates media files?

Akismet put me in timeout and then forgot about me, I think for linking twice in two posts to sqlitetools for RootsMagic. I will try to recreate my response without the offending link.

I think this is on the right track, up to a point. Most of my media is also tagged to citations, and most of the excess in the gedcom is due to webtags rather than media. I did Merge Duplicate Sources and then with my fingers crossed also used Merge Duplicate Citations. But I would have said after that everything was fine in RootsMagic–the only problem is in the gedcom file. I ran the DeleteDuplicateCitationLinks-RM8.sql script from the aforementioned but not linked sqlitetools for RootsMagic site, except using a SELECT rather than a DELETE, and it came back empty. That is the proper behavior for an RM8 database that’s working correctly, unless I’m misunderstanding something.

What happens in the gedcom file, though, is if a source has 100 distinct citations, for instance 100 different pages in a census, and 100 individuals use that source, one citation each, I don’t get one citation (and associated media link and webtag and URL) for each individual, I get all 100. Where I would expect one source with one media link, webtag, and URL for each person, I get one source with 100 media links, webtags, and URLs, so 10,000 entries instead of 100 for each source, for all sources, so it adds up quickly. TNG is doing the right thing associating all the media from all 100 citations to one source for one person, because that’s what the RM gedcom told it.

Either it’s the issue @thejerrybryan identified with GEDCOM not dealing with reusable citations, in which case everyone should have essentially the same problem I do, or somehow I’ve told RM to associate all citations with all uses of a source, in which case it seems like all the duplicate media should show up in RM8 itself, which it doesn’t. Or option three I can’t think of, which is why I posted.

Thanks for your help.

I think it is a problem in the export of reused citations created by the “Merge all duplicate citations” function which I’ve warned about in the past can merge citations that have differences in Media and WebTags. And maybe it is a general problem with reused citations whether manually (Paste reuse) or using that function. I just did a wee test:
DB1: a TreeShare download with no merging.
DB2: above after Merge all dup sources and then Merge all dup citations
DB3: import of GEDCOM exported from DB2

Here are the salient database properties:

                  DB1   DB2   DB3
Sources            74    74    74
Citations         631   123   631
Media             105   105   105
MediaLinks        485   113   709

DB3 should be the same as DB1 if all conversions worked correctly; something is amiss with Media Links.

Web Tags are not reported by RM but a quick look at the URLTable using SQLite shows that they mushroomed from 30 in DB1 to 364 in DB3; something amiss there, too!

So @randalrh, you’re not alone. Maybe you’re the only one to notice!

1 Like

I think Tom has pretty much identified the problem. It’s curious that the WebTags seem to mushroom more rapidly than do Media tags.

I think you need to submit a trouble ticket with your database to the RM HelpDesk.

I would suggest you revert to the version just before you did the mass merging and export from it.

Also, do some comparisons between that version and the post-merge within RM to confirm that everything is alright. I’m betting it is not and that you will find citations with multiple media and multiple web tags that should have been kept as individual citations with singular media and web tags. That’s due to the RM function failing to take into account differences between each citation’s media and web tags when looking for duplicates. If the user has not established differentiation in the Citation Name or any other Citation fields, the function declares them duplicate, despite having a different media or web tag. Citations downloaded via TreeShare from Ancestry are notorious for lacking differentiation where it matters for the RM function to detect.

I believe this is a significant error in design by the developers and they have let loose a bulk modification feature that traps users into irreversible corruption of their databases. How many times have we heard RM Inc say they would not develop some bulk change feature because users might get into trouble if they misuse it? This is another example of RM adding something to the features list that has not been thoroughly developed and tested. It’s a change in an area with many relationships and dependencies and they have overlooked two of them.

What I don’t know and have not tested for is whether there is a flaw in the GEDCOM export of reused citations, setting aside the faulty merges. Probably not, assuming that the developers tested the round trip with ideal data until they got that much right.

The best test would the following:

  • Export a GEDCOM from RM7.
  • Import the same RM7 database into RM8.
  • Merge duplicate sources and duplicate citations in RM8.
  • Export a GEDCOM from RM8.
  • Compare the GEDCOM from RM7 with the GEDCOM from RM8.

The compare will obviously not be a 100% match, but the citations, media files, and webtags should pretty much be the same between the two GEDCOM’s.

A variation on this theme would be the following.

  • Import an RM7 database into RM8.
  • Export a GEDCOM from RM8.
  • Merge duplicate sources and duplicate citations in RM8.
  • Export a second GEDCOM from RM8.
  • Compare the two GEDCOM’s from RM8 with each other.

Again, the compare will obviously not be a 100% match, but the citations, media files, and webtags should pretty much be the same between the two GEDCOM’s.

As Tom indicated, problems with citations, media files, and webtags would be most likely to show up on RM databases downloaded from Ancestry using TreeShare and where some of the Ancestry sources were from collections that were distinguished only by the media file.

Thank you both for your analysis, reproducer, and test case suggestions. I will do some testing this weekend and report back here what if anything I figure out. I will also likely wind up submitting a support ticket.

Not directly related to your problem but I too am exporting to TNG and have a question about how you’re getting all the media into TNG as it’s not in the GEDCOM. Via admin seems cumbersome.

The media is not in the GEDCOM but links to the media are in the GEDCOM. I have played around a very little bit with TNG but it was a very long time ago. Does TNG not process the media links that it finds in GEDCOM?

Yes, it processes the media links, but the media is on my local computer in my RM media folder so the links are there but they are all pointing to a media folder on the server that’s empty. I’m wondering what’s the best way to populate that folder as the media manager in TNG only accepts a few files at a time. I’m assuming I backdoor it with FTP but I was wondering how another RM user was managing their RM → TNG activity.

You would probably get a better, more expert, answer on the TNG forum rather than here, but I use Filezilla (which, at least when I started using it, was the recommendation of the TNG developer) to move media in bulk from my local computer to the server where I run TNG.

I don’t know what that means. You might be using a mechanism I’ve never tried. TNG doesn’t need to move the media, it just needs to point at it correctly once it arrives.

I found I video from TNG that specifically mentioned RM (and FileZilla) that mostly solved my problem. My remaining issue seems to be specific to the provider that hosting my TNG sites. Turns out the media manager wasn’t intended to bulk upload for links and FTP is the only way to do it.

First, the gedcom file generated from a database before merging sources and citations using the RM8 tools is of an expected size, not 50 times that size.

I made a three-person tree in Ancestry and used TreeShare to bring it into RootsMagic to try to trace what Merge all duplicate sources and Merge all duplicate citations does and then does to the gedcom file.

Merge all duplicate sources found two (of ten) sources it thought were the same, and correctly so. The mystery is how they came to be considered different sources by TreeShare in the first place. They both are connected to the same marriage event, a family event, so OwnerType 1 in the EventTable. The two sources are a census and a marriage record. Both have multiple citations associated with them–all of the OwnerType 0 (individual) citations are associated with one instance of the source, while the one OwnerType 1 (family) citation is associated with an identical but separate instance of the source. Merge all duplicate sources recognizes them as the same, but doesn’t merge them completely. The two sources maintain a different though identical address (really repository) in the AddressTable: Ancestry.com. They also have the same media, downloaded under a different filename in the “path/file1” and “path/file1 (file 1)” convention. The effects I see in the RM database are that the Source table now has eight entries instead of ten, the two OwnerIDs for the duplicates in the AddressLink table point to two other entries, so there are eight OwnerIDs (the eight sources) instead of ten, and the last two citations in the CitationsTable have likewise changed SourceID, so that there are eight rather than ten SourceIDs in that table.

What does this do to the gedcom file? It’s a little smaller, because there are two fewer sources. The two sources that had duplicates now have two (identical) repositories attached to them, but that amounts to one line each in the gedcom. The files are otherwise the same except that the two duplicate sources appear under their new names: S9 is now S1 and S10 is now S4.

Merge all duplicate citations uses a similar approach but results in much larger differences. The count of citations in the CitationsTable shrinks from 37 to 12. Why 12? It appears to be the case that a CitationID is unique if it has a distinct CitationName and SourceID. For my simple database, that’s enough–one citation now correlates with one item of the associated media, not including duplicates of the form “path/file1 (file 1)”, since the MediaLinkTable also decreased, from 30 items to 14. The CitationLinksTable still has 37 entries, all pointing to the correct CitationID, and the URLTable, which functions kind of like the MediaLinkTable, also shrank to the correct number of _WEBTAGS (2).

So what’s the problem? The problem is who owns the MediaLinks and URLs and if it owns more than one. The OwnerType in each case is 4. I don’t know what that is, exactly, but I think it’s a CitationID. If the CitationID has one media item or URL associated with it, all is fine. If, however, it has two, it will list those two every time it’s used. If it has 100, it will list those 100 every time it’s used, which presumably is at least 100 times, both in the RootsMagic interface and in any gedcoms produced. The gedcom for my little three person tree increased 10% after merging sources and citations, and as mentioned previously, my main database increased by 5000%.

Here’s where I have to stop and get some feedback. This might be considered proper behavior by a developer–after all, if you’re a proper (meticulous) genealogist, you can make sure all your citations having more than one media item have unique names and one media item each after you use TreeShare–I think Merge all duplicate sources and Merge all duplicate citations would work fine. If, on the other hand, you’re me and all the findagrave citations don’t have a name but have unique URLs, your gedcoms will get big and you won’t be able to tell which media or URL goes with the instance of the citation you’re looking at. There might be an SQL solution for getting to one media item per citation after the fact, but I understand SQL like I understand my kids talking about pop culture–I can repeat it and kind of understand it, but I wouldn’t want to try making any up.

I just read over the thread, and it looks like I just confirmed what @TomH had already said, only using more words.

1 Like

I can’t explain every detail of this discrepancy to my satisfaction, but the root cause is that Ancestry doesn’t have family events (AKA couple events). Instead, marriage is an individual event in Ancestry and each individual in Ancestry has their own separate copy of a marriage event including each separate copy of the marriage event has its own separate copy of the sources. As I said, I really don’t understand the details of what happens in TreeShare when marriage events are transferred in either direction between RM and Ancestry.

One question I have that I have never explored is that Ancestry has the same problem with GEDCOM because GEDCOM treats Marriage as Couple event just like RM. I have never explored what happens if data is transferred to and from Ancestry via GEDCOM. If Marriage is transferred cleanly to and from Ancestry via GEDCOM (and it may not be), then I would wonder if TreeShare might be able to mimic the GEDCOM transfer somehow or other.

In any case, I’m sure that in the case you cite, TreeShare somehow or other “merged” the duplicate Marriage events coming in from Ancestry without also “merging” the sources for the duplicate Marriage events.

Yes, that’s correct. Also, an OwnerType of 3 is for a SourceID in RM. But “sources” in Ancestry become “citations” in RM and vice versa. The terminology surrounding sources and citations and footnote sentences is profoundly inconsistent.

My view is that OwnerType 1 (Family) citations are pretty worthless, both in RM and in GEDCOM.

Suppose you have a family consisting of John Doe who married Jane Smith and their two children William Doe and Sarah Doe.

You can enter Family citations on the Parents line in the Edit Person screen for either William or Sarah or both. You can also enter Family citations on the Spouse line for either John or Jane or both. So let’s suppose you have a citation for a Birth Certificate for Sarah Doe on her Parents line that says that her parents are John Doe and Jane Smith. The same citation will immediately show up on the Spouse line in the Edit Person screen for both John and Jane, and it will even show up immediately in the Parents line in the Edit Person screen for William Doe.

In my view, this is utter madness. But let me give an even better example. Suppose John Doe and Jane Smith were married in 1849 and were enumerated together in the 1850 census. It might be logical to have a Census (Family) fact for them for the 1850 census and to use the 1850 census as a citation for the Census (Family) Fact. Then suppose John Doe was born in 1852 and Sarah Doe was born in 1855. As soon as you enter William and Sarah into RM, the citation for John and Jane’s 1850 Census (Family) shows up in the Parents line in the Edit Person screen for both William and Sarah, even though neither one of them was even born in 1850.

In my view, this is utter, utter madness. And it really isn’t even RM’s fault. That’s because RM is faithfully following the lineage linked data model followed by GEDCOM and also by most other genealogy software.

And wait! There’s even more! Let’s run a narrative report for Sarah. Her birth certificate citation does not appear in her report. So let’s run a narrative report for the whole family. Sarah’s birth certificate citation does appear in the report. But it appears only once, and that one appearance is in the list of family facts for John and Jane such as Marriage and Divorce. The citation is still not connected to Sarah in any way. But the citation superscript for Sarah’s birth certificate is not associated with any fact because there is no fact for it to be associated with. Instead, the citation superscript is just floating in mid-air, not connected to anything.

For those reasons, I think that all of genealogy should have a standard “spouse/partner/or something” fact that can be associated with any pair of people who at some point become married or partnered or who become biological parents of the same child. And I think all of genealogy should have a standard Parents fact which actually is linked to a person and their parents and can provide a place to link a citation for evidence of parentage such as a birth certificate.