Duplicate Citations vs Tags or "Uses"

Due to improper use/behavior of TreeShare “Anna Powers” in my tree has 257 Person Citations. Many, many are duplicates of course. Hoping to eliminate the duplicates, I did a “Merge Duplicate Citations” on the database and my total number of citations went from 17,294 to 2,676! (This was the first time I dared do this operation.)
However, when I look at Anna Powers’ Person Citation list I still see 257 entries - all the duplicates are still there! I exit the program, restart, eve run the database tools: still 257 citations. On the other hand, I did a Narrative Report on just Anna and I just see about 40 separate endnotes. And no duplicates! Perfect!
My conclusion is that the Citation list accessible from the Edit Person screen is really a “Citation Tag” list. In other words, if the same citation is tagged 5 times to Anna Powers’ person, it will appear in her Person Citation list 5 separate times.
I guess this is mostly a nuisance but I would like to be able to eliminate all those extraneous citation tags.
Am I getting this right?

Sounds right. RM should have a tool to delete duplicate tags. It wouldn’t be hard to add.

The way citations and duplicates thereof work in RM8 and RM9 can be a little complicated to explain. Let’s try it this way.

Suppose you had a perfect RM database with no duplicate citations and no other errors whatsoever. Then suppose you entered a new person from an obituary and that you entered the obituary as a citation. Finally, suppose you Memorized and Pasted that same obituary 17 different times to 17 different people who were mentioned in the obituary. Finally, suppose when you pasted that same obituary 17 different times you used the Paste with Reuse option each time you paste.

Well, how many copies of that citation are there in your database? Since the citation was in the database 1 time to start with and then you Memorized and Pasted it 17 times, you might think there are now 18 copies of that citation in your database. But you would be wrong. There is only 1 copy of the citation in your database and that 1 citation is linked to 18 different places in your database. If you find a typographical error in that citation and change it in any 1 of the 18 places, you will see the error corrected in all 18 of the places. That’s because all 18 places where the citation is used is are really reusing the same citation. All 18 places have their own link to it, so there is 1 citation with 18 links.

Now let’s go through the same scenario again except that when you Memorize and Paste the citation you use the Paste with Copy option each time you paste. This time, how many copies of the citation are in your database? Well, there are 18 citations and each citation only has 1 link. In the first case, there was 1 citation and it has 18 links.

What the Merge All Duplicate Citation tool really does is to delete all but one of the duplicate citations, and then it makes a link from the remaining citation to all the places all the duplicate citations were linked. It may sound scary, but nothing is lost.

Your situation is more complicated because the citations all came into RM from Ancestry. So you didn’t do any Memorizing and Pasting of citations. In a perfect world, all of what appear to be duplicate citations should have been merged and all would be well. The fact that the duplicate citations didn’t all merge suggests a couple of possibilities. One possibility is that some of the the duplicate citations are actually associated with different Master Sources. In that case, it’s worthwhile to run the Merge All Duplicate Sources tool before you run the Merge All Duplicate Citations tool.

The other possibility is that some of your duplicate sources or duplicate citations are not exactly duplicate. For example, they could include text that differs only by a blank or a comma. It’s hard to believe that all 257 duplicate citations are different in this manner, because each one would have to be different from all of the other 256 ones.

If all else fails, there is a manual merge. But that could be very tedious because you can only manually merge two citations at a time. So before doing that, can you look at some of those 257 duplicate citations for Anna Powers very carefully to see if you can see any differences at all.

There is one other possibility that I’m reluctant to bring up. Before I get into all those gory details, can you share the name of the Ancestry collection from which all those 257 duplicate citations are coming from?

Ohhh, I know where you’re going. Check this @DaveW : how many media are tagged to your Citation that has 257 uses?

Actually, I’ve changed my mind just a little bit about where I am going.

Where I originally was going was that there are Ancestry collections where the citations that come into RM via TreeShare differ only in the media file that comes into RM. Two totally different images that are really two totally different citations can come into RM with the same string of characters as the citation field. In such cases, RM8’s Merge All Duplicate Citations tool would ignore the media file difference and would therefore merge citations that were not really duplicate. So let us suppose that RM9 has solved that little problem. It would be an easy enough test to run, but I don’t have time to run the test tonight.

If RM9 has solved that problem, then there is another problem that has nothing to do with Ancestry’s collections where the citations that come into RM via TreeShare differ only in the legitimate difference in the media file. Namely, two citations can come into RM via TreeShare from most any collection that really are duplicate citations and where the media file really is the same file with the same name. Except that the “main” file name may be the same and some additional text may have been added to the file name in parentheses - sort of like a version number.

I don’t know if this “version number” thing happens in RM or in Windows or somewhere else. The same thing can happen if I download the same file from the Internet twice with a browser, like if I download a PDF that is a bank statement. The second download sometimes doesn’t overwrite the first download but rather uses the “same” file name with an added “version” number. Several RM users have reported this issue with files downloaded from Ancestry via TreeShare, and I don’t think any special Ancestry collection is required to create the problem.

So my new ask of the original poster is to look at the media file names for all those 257 duplicate citations to see if they have the “same” file name except for the added “version number” in parentheses.

I suspect that these are 2 versions of the same use case… the first is 2 or more different image files attached and the 2nd is the same image with different file names. As you noted, when the same filename is downloaded via treeshare the versioning that occurs is to repeat the file id within parenthesis. My personal best (thus far) is to have downloaded the same image 3 times, leading to a file name of “unique_id(unique_id)(unique_id).jpg” . It’s a mouthful!

I use treeshare mainly because I want source data that I found on ancestry.com to appear as “Ancestry” sources and not “Other” sources" when I push changes back up. (I don’t run a lot of reports so I don’t (yet) encounter the issues that using the Ancestry Record source template creates.) That said, the source data that gets downloaded is almost always incomplete and is inconsistent across ancestry source collections. So, every source/citation downloaded via treeshare needs some amount of manual editing after the download in order to have a complete record. That’s true without even considering where a citation transcription should reside. Editing the record in RM avoids this “auto merge” issue. I suspect that’s the only real “fix” to this problem.

My apologies for the delayed reply. It appears that the global source and citation merge processes did two things:

  1. they merged all “identical” sources/citations without considering the tagged media. Unfortunately, I don’t know what this citation looked like before merging but I greatly doubt it had 91 media and 654 uses or tags. This probably would not have happened if I had uniquely named each citation but I naively TreeShared them over from Ancestry and their only distinguishing feature was their unique media.

  2. if there were, say, 8 duplicate citations in a Person list prior to merging, the merge process failed to merge the associated tags or “uses”. It just left behind 8 duplicate tags making it appear that the duplicate citations were not merged. Essentially the Person Citation List is now really a Person citation tag or use list.

I see this as a serious flaw in the RootsMagic program. While TreeSharing is encouraged there is no warning of this effect of the merge processes. IMO the merge processes should consider tagged media as a distinguishing feature of a citation.

As a result, I will probably restore a recent (unmerged) backup. Lesson learned.

Thank you Jerry and Tom for your time analyzing this issue. It looks like Jerry correctly identified the problem - identical citations with different media.

I’m back on the computer. I have confirmed what you already discovered. This bug is not fixed in RM9.

Unfortunately, the New York County Marriage Records collection is one of those in Ancestry where TreeShare working in tandem with Ancestry produces citations in RM which are identical except for the media file which is different. I suspect the bug to create the citations that are identical except for the media file is really in Ancestry rather than in TreeShare, but it’s hard to know for sure.

Then there is the second bug which is in RM8 and which was not fixed in RM9. Namely the Merge All Duplicate Citations tool does not take differences in the media file into account when determining if citations are duplicate or not. It really must do so because of the way citations can come into RM from TreeShare.

In your case, you are not presently able to use the Merge All Duplicate Citations tool. But in general, it’s extremely important that this tool be reliable enough to be used. That’s because the consequence of not using it is that otherwise the Reuse Duplicate Endnote Numbers option will not work for printed reports. This problem is new in RM8 and thus far continues into RM9. The reason RM7 didn’t have the problem is that it determined duplicate endnotes based on the text in the endnote. But RM8 and (thus far) RM9 determines duplicate endnotes based on whether the same entry in the CitationTable is being used. That in turn means that the Merge All Duplicate Citations tool must be run which you can’t do because of the citations are really different but which differ only in the media file.

The other little wrinkle that I mentioned is still in the picture, but it apparently it is not your problem. Namely, the same media file can be downloaded multiple times by TreeShare with the resultant media file name differing only it its versioning. I have no idea what to do about that problem. I have not been able to recreate it myself. But enough different RM users have reported that problem that I’m persuaded it’s a real problem that needs to be addressed somehow or other.

@DaveW, you had the same problem a year ago which you correctly analysed then.

It’s a shame RM has done nothing to mitigate the risk. I’ll bet there are dozens, if not hundreds of users who merrily clicked on that control and have yet to discover the mess it has made with citations from certain Ancestry Record Collections.

Indeed I did Tom. Deja vu all over again…

Agreed Jerry. There are serious shortcomings with Merge All Citations but it does seem to clean up the Endnotes in reports. As I mentioned, in this example, even though 257 citations or citation tags remained in the Citation Lists after merging, the Narrative Report showed only about 40 endnotes. (However, one of those remaining endnotes will point to a citation with 92 media tags.)

Also, I do have many duplicate media that differ only in their versioning: e.g. xxx_xxx.jpg and xxx_xxx.(xxx_xxx).jpg. Apparently, RM cannot discern identical media and thus creates duplicate copies. And more troubling is that these duplicates often have different tag sets, making it difficult to delete the duplicates. But I do see that as a separate problem.

I had that issue a I think in January of 2021 in RM 8. I had imported from FTM to RM 8 and thousands of my citations from Ancestry either didn’t have information in the citation field, or dupe information and I discovered that RM 8 does not take into account the media or webtags that are attached in determining what is a dupe. I brought this to RM support’s attention but they didn’t really seem interested. Eventually I started the import over again but did not dedupe the citations. Instead I spent about a month going through an manually making sure the proper source and citation information was in the right fields. Then when I ran the citation dedupe it was all good. But yes, it can cause major problems especially if the person doesn’t have a good backup first.

Same experience. I do not use the Merge All Citations feature since it may merge media with unrelated tags. I only merge citations manually. This feature needs to be redesigned/debugged.