Apparent issue with interrupted AutoMatch operations on large databases

I think that there might be an issue with the implementation of the FamilySearch Automatch algorithm that leads to large databases ending up in a pathological state. Here’s what I observe:

I work on databases whose size ranges from a hundred people to 600,000. Typically I see maybe 80% of people having viable FamilySearch matches.

Imagine that I import a 10,000 person tree from Ancestry that has never been automatched. I run Automatch and leave it running whilst I watch tv. Four hours later I return and I see from the progress bar that it’s completed half of the database, and it’s matched 4000 people. I want to go to bed, “No problem. I’ll restart it tomorrow morning”. The next day I turn on RM8 and I note that I have a 10,0000 person database with 4000 people that have been matched. So I click the automatch button and go do some housework. I walk by the computer an hour later and I see that it’s completed about 1/4 of the DB and it hasn’t automated anyone.
“Why is this?”
Because it’s trying to match the unmatched people in the same order as last time - so the first people who are being tried are the 1000 people who were tried yesterday and couldn’t be matched. That’s an hour of wasted time before the real AutoMatching will begin.
“So dont interrupt Automatching operations” you say, let it run till complete.

But what happens when your database has 600,000 people, you were able to run automatch for four days and you matched 175,000 people?
If you want to restart Automatch now you might leave it running for 16 hours before you match a single person. That’s what I mean by a pathological state.

If the implementation remembered where the last automatch operation stopped and continued from that point it could avoid this issue.

What do you think?

1 Like

So, where do you even start to make sense out of a db with 600,000 people? What type of research are you doing? Just curious

So that tree is being used to try make sense of DNA clusters. It’s an experimental fishing tree, combining over 600 different people’s trees (all DNA matches) looking for common (distant) ancestors, the explicit focus being quantity not quality.

For example, I have an unlinked cluster of over 40 DNA matches who overlap on the same segment of DNA. With this tree I saw that 30 of these are ancestors of a Thomas Livesay, born in 1700, in Pleasington, Lancashire. It happens that 3 of my 32 3rd great grandparents were born with 3 miles of Pleasington - so I am theorizing that one of these 3 ancestors might be a descendant of Thomas Livesay.

When I look across the four DNA websites where I have uploaded my data I have over 30 such clusters. I am focussing on 16 clusters that contain more than a dozen matches. Thus far, I have solved (linked) four of these clusters. With two of them I identified the biological father of a 2nd ggf and a 3rd ggm, both of whom were NPEs - born “out of wedlock.” My goal is to solve another three of the remaining dozen clusters in the next two years.

Does this make sense?

Peter

Note that my academic background is in applied geophysics. This is the science of using physics to measure things at the earths surface or from the sky or underwater to make inferences about what is happening under the ground. These techniques take very large numbers of high noise, low precision measurements to search for gold, minerals, oil, even buried treasure or human bodies.

Yes, it will look for those not matched previously. That’s by design. It has no idea if someone added an exact match on Family Tree since the last time it looked for them. The better work flow would be to take that 10,000 person Ancestry tree and run AutoMatch on that before you merge it into the 600,000 database.

Thats my new workflow. But it still leaves me with the question of what to do with the 600,000 tree that I have. There isn’t an obvious way to tear it into pieces without losing the connections between people.

Is there. a way in rootsmagic to reorder a database?

Not within the RM UI. Possible but very challenging with outboard sqlite queries.

Rather than renumber everyone, I wonder if offsetting those matched to the highest numbers would suffice.

Wow, that is a lot of data mining, but you are experiencing some success.

I wonder if you would have greater progress in RM7. RM8 has (or had) a memory management issue which progressively consumes RAM up to the 2GB limit for 32-bit apps set by Windows. You can convert your RM8 database to a compatible RM7 database with some (probably not critical) losses: Direct Import of RM8 database into RM7 – Part 2 – SQLite Tools for RootsMagic

I wonder if you could export a subset of people to GEDCOM, import into a new RM database without losing any links to FSFT or Ancestry. Run the FS AutoMatch. Then use SQLite to import the matches from the subset.FamilySearchTable to the main.FamilySearchTable. While the subset may have lost some family linkages, they will remain in Main. Of course, their absence may affect the AutoMatch success rate but perhaps that’s a minor collateral impediment to achieving your objective.

And another thought is alternative software… RM is not the only one to automatch with FSFT, is it?

Windows has a 2GB memory limit for 32 bit apps? What are the limits for 64 bit apps and why would someone still be using 32 bit? Would not some heavy database and spreadsheet applications reach this memory usage with large business data sets?

On my mac RM8 is a 64 bit program albeit still intel requiring initial Rosetta translation to Apple Silicon. RM8 at idle uses 20% CPU and about 330 MB ram which is extremely high cpu and minor ram usage.

|Operating System |Maximum Memory (RAM)|
|Mac OS X from 10.10 |18 Exabyte|
|Windows 10 Home 32-Bit |4GB|
|Windows 10 Home 64-Bit| 128GB|

1 Like

The reason Tom stated 2GB has to do with user address space (2GB) and kernel address space (2GB). The kernel address space is for operating systems bits, such as drivers and stuff. The user process, RM, can’t directly access the kernel address space, hence it is limited to about 2GB.

1 Like

I think you’re right.
In fact I’ve made the same observation but you’re optimistic enough to report it

On Mac OS I started with a separate Mac OS account for RM. When I experienced extended run times I created a second Mac OS account and copied the RM DB file(s) to the second account. That allows a second instance of RM.
I started a duplicate name check earlier today (120000 people in the DB). I can monitor RootsMagic from other Mac OS accounts using Activity Monitor. At his point it has used 5.5 hours of CPU but only using about 120% CPU (of 1200%, CPU is I7 w/6 cores and 6 threads).
I don’t know if you can duplicate this on Windows but it is working well for me on Mac.

I agree RM needs a way to PAUSE and RESUME some of the Family Search features.

What brilliant idea that was not obvious to me. Thank you for sharing this. There are many long running RM operations that I do, and often put off doing because I don’t want to stop working in RM whilst waiting for them to complete.

Thanks,

Peter

This is certainly an interesting idea, but brings up the question of multiple instances of RM accessing the same database file.
I’ve done it and never had any corruption issues, as I believe that SQLite is designed for it, but RM is not designed for it. When running a second instance of RM, one will often get the “database locked” message in a dialog that must be manually dismissed. And I don’t think that RM retries to operation, potentially causing orphan records (due to SQL transactions not being used).

I think that Automatch works through the people in the database in RIN order. As you say, this means that every time you start it, it goes through the same people it has already been through before in the same order.

In another thread, I suggested that it should work through the database in RIN order, but highest to lowest. This would automatically mean that it would start with the people that it had not looked at before for whom there is the biggest payoff.

This seemed to me to be a very useful suggestion for those with large databases, and probably not the most difficult to achieve. Unfortunately no-one else picked it up.

For a large database, might there not be much the same problem with some fraction never reached by auto-match? That might be mitigated by giving the user the choice of ascending/descending RIN order for auto-match or, better yet, auto-match for groups. Does the latter already exist?

@pbooth99 , how are you doing with auto-match now?