Search independent of diacritical (accent) marks

I added in all the diacritical marks for my French ancestors’ names. It was easy enough to do by switching to the U.S. International Keyboard. Then I to discovered that it makes searching more difficult. I’m an English speaker and I don’t always remember how a name is accented. Yet, in order to search for a name, I have to enter the accent marks or it doesn’t find the person. If the accent is near the end of the name, it’s not too bad, because I’ll see the name pop up when I start typing. But for an accent near the beginning of the name it’s kind of a pain.

I would like to suggest that name searches be diacritical-independant, like FamilySearch and Ancestry.com.

1 Like

I think this is a very difficult problem and I’m by no means sure I have a potential solution to suggest. It’s reasonable to consider that if FamilySearch and Ancestry can do it, then so can RM. But one language’s diacritical marks are another language’s distinct letters.

The Norwegian Å is not an A with a diacritical mark. It is a different letter of the Norwegian alphabet. The German Ü is not a U with a diacritical mark. It is a different letter of the German alphabet. The Spanish Ñ is not an N with a diacritical mark. It is a different letter of the Spanish alphabet. Users in those languages want their letters. They don’t just want English letters with diacritical marks.

On the other hand, my understanding is that the French and English alphabets are identical and that French simply uses diacritical marks rather than adding new letters to the alphabet. Well, sometimes English uses diacritical marks as well, but not nearly as frequently as French. I hope my understanding of French using diacritical marks rather than separate letters is correct.

I suspect but don’t know for sure that your French letters with diacritical marks are being stored almost as if they were actually separate letters rather than as being the same letter with a diacritical mark. Part of what makes this so difficult is that Norwegian users don’t want their Å being stored and searched as if it were an English A with diacritic marks. They want it stored and searched as if it were a Norwegian Å.

Also, sorting comes into play. If I understand correctly, French letters are supposed to be sorted as if the diacritical marks were not there. On the other hand, the Norwegian Å is supposed to be sorted at the end of the alphabet instead of at the front of the alphabet with the rest of the A’s.

As I said, it’s a difficult problem and I don’t know the solution.

1 Like

In place of the diacritic mark use a wildcard. The underscore _ wildcard matches any single character. Example: SM_TH, will return results for SMITH and SMYTH families.

The percent sign % wildcard matches any sequence of zero or more characters. Example: Be%t will return results Bennet and Bentley

3 Likes

You may want to consider adding alternate names without the diacritical marks.
They will be in the index and point to the correct person.

It’s a lot of work to do at one time. I use Renee’s method myself.
It also duplicates something RM has already done since v9 was released by adding new database fields that contain the name minus diacritical marks, but the new database fields aren’t available in the app yet. It shows it’s something they’re thinking about.

Hi Richard and Rene, OK, thanks, I’ll do the wildcard search for now and hope for a future update that would take care of this.

wouldn’t it just be a case of indexing it with a different collation?

That’s a very interesting question, and I’m not 100% sure of the answer. But I believe the following is correct.

  • RM uses SQLite as its database engine. (This one is 100% certain.)
  • A column in an SQLite database can only have one collation.
  • The index for a column in an SQlite database uses the same collation as the column itself.
  • Therefore, the only solution for RM would be to duplicate the data from one column to another column, and then to have the collation in the two columns to be different.

A column can have only one default collation- that’s the one mentioned in the create table DDL. I think one can then create as many indexes as desired with each collated as desired.
CREATE INDEX sec 1.5

But I don’t know how you would specify the index in a select.

If you didn’t want to duplicate data into another column you’d pick just one collation. If you used a case insensitive collation then the search “nery” would match “Nery”, “NERY”, “Néry” etc.

If you needed to search for one of those specifically you’d need to do the broad case insensitive search then filter in business logic.