Using the rm database in EFCore, RMNOCASE

Has anyone worked with the RM database in EntityFrameworkCore? Scaffolding the model from the database fails due to the unknown RMNOCASE collation. Also, when I build the model manually, I can’t save data for the same reason.
Any pointers to a solution are very wecome.

The proprietary RMNOCASE collation used in RootsMagic sqlite has been a tripping factor for a long time for users of non-RM connections to RM databases. I’ve no knowledge of the system you are trying but have addressed workarounds with other sqlite managers ranging from explicit overrides with the standard NOCASE collation to extensions for fake RMNOCASE collations. For more info on those, see
https://sqlitetoolsforrootsmagic.com/rmnocase-faking-it-in-sqlite-expert-command-line-shell-et-al/

1 Like

see also-

Genealogy-scripts/Notes on collation RMNOCASE.md at main · ricko2001/Genealogy-scripts · GitHub

Most of my info came from sqlitetoolsforrootsmagic.com

2 Likes

Thanks a lot, Tom and Richard! This is great info. I’m going to try to use the fake RMNOCASE extension within the dotNet framework. My goal is to a) import citations/sources from a research tool I’m working on and b) generate a gedcom file to export my tree to ancestry.com.

Great. Be aware that RM has the ability to interact with the Ancestry and FamilySearch API and can transfer data directly without a GEDCOM file.

Of course, it may not be what you are looking for.

I did C# for many years, and just recently started work with Python.
Here is my site-
https://richardotter.github.com

After a lot of googling, I figured out how to use the unifuzz64.dll in EntityFramwork. This also enabled the scaffolding/reengineering features that generated a nice and complete set of C# model classes for interfacing with the database.Unfortunately the RM database design is a bit unusual so the automatic linking of the tables via their foreign keys does not work. But this is only a minor inconvenience.

I’m now able to read/write to the tables.So far there are no database inconsistencies after changing data. Next step is to programatically generate a sourc/citation and link it to an event. (sqlitetoolsforrootsmagic.com is a great ressource for understanding the database structure indeed.)

Be aware that RM has the ability to interact with the Ancestry and FamilySearch API and can transfer data directly without a GEDCOM file.

I only found a posiblity to read the tree from Ancestry into RM and then sync single items with TreeShare. I also found no way to sync changed sources/citations which is my main use case.

I’ll have a look at your GitHub later.

You can fake RMNOCASE directly in EF Core. In fact, with a tad bit of work you can create an empty RM database using the database first capabilities in EFCore.

        public RootsMagic8DbContext(string databasePath)
        {
            var connectionString = $"Data Source={databasePath}";

            this.connection = new Microsoft.Data.Sqlite.SqliteConnection(connectionString);
            this.connection.CreateCollation("RMNOCASE", (s1, s2) => string.Compare(s1, s2, StringComparison.OrdinalIgnoreCase));
        }

Now I cheated on scaffolding the models. I used SQLite Studio with the extension to generate a full schema in SQL, removed all the COLLATING RMNOCASE and created a database from that, then scaffolded from there.

You’ll want to save both copies of the SQL so you can find all the fields using RMNOCASE and manually adding them back in like this:

            entity.Property(e => e.Surname)
                .UseCollation("RMNOCASE");

To create the database using DB first you’ll need to add a default entry in the ConfigTable that has the options. It will look “tricky” since it is a blob and may not display nice in your SQLite client, but you’ll find it’s just an XML string you can serialize and deserialize. As an FYI, the Reports and Books are stored in the config table in the same format.

Thanks for describing your method.
I use the method described here: c# - How to load extension for SQLite (SQLite + Entity Framework Core) - Stack Overflow to load the unifuzz64.dll in the framework. I.e., I do:

public class SQLiteExtensionInterceptor : IDbConnectionInterceptor
{
    public DbConnection ConnectionCreated(ConnectionCreatedEventData eventData, DbConnection result)
    {
        var sqliteConnection = (SqliteConnection)result;
        sqliteConnection.EnableExtensions();
        sqliteConnection.LoadExtension(@"unifuzz64.dll");
        return sqliteConnection;
    }
}

To inject the interceptor I derive from the actual (scaffolded) context:

public class DB: rmContext
{
    public DB() : base(makeOptions())
    { }

    public static string sqLiteFile { get; set; } = string.Empty;

    static private  DbContextOptions<rmContext> makeOptions()
    {
        var opt = new DbContextOptionsBuilder<rmContext>();
        opt.AddInterceptors(new SQLiteExtensionInterceptor());
        opt.UseSqlite($"Data Source={sqLiteFile}");
        return opt.Options;
    }
}

So, I can simply do things like

 internal class Program
 {
     static void Main(string[] args)
     {
         DB.sqLiteFile = "test.rmtree";  // set the static database filename once (could also be set directly in the DB class of course)
       
         using (var db = new DB()) // connect to the database
         {
             foreach(var address in db.AddressTables)  // read out the Name column from the AddressTable
             {
                 Console.WriteLine(address.Name);  
             }                    
         }            
     }
 }

I need to find out if it is possible to change the model names in the scaffolder. E.g. Addresses instead of AddressTables would read much nicer. One can of course change them manually but the scaffolder would overwrite the new names if run again.

I’m using the method described because it’s cross-platform. For whatever reason loading the dylib doesn’t seems to work using the LoadExtention method on MacOS. Long term I want to be able the do the equivalent of a “drag and drop” to a new database without data loss.

Understood. I had a look at the source of unifuzz.c which is doing quite complicated unicode stuff which I don’t really understand. I wonder if your simple

 (s1, s2) => string.Compare(s1, s2, StringComparison.OrdinalIgnoreCase));

is taking care of all this? I assume that c#, being quite modern, handles unicode much better than plain c.
Does anyone have a list / database file of complicated unicode names/text on which both versions could be compared?

Neither option does exactly what RMNOCASE does, which is why you need to rebuild indexes within RM after doing updates to affected fields. My method is relatively simple and cross-platform, I got it from stackOverflow some years ago:

Under normal EF6 (prior to EF Core) the scaffolding ignored the RMNOCASE so I didn’t need to create the “clean” database to scaffold from, oh well.

In any case I know I’m not in the minority of technical users of RM who wish they’d just do away with RMNOCASE.

I find RM’s RMNOCASE collating sequence to be frustrating when it comes to writing SQLite queries against my RM database. But that’s not my biggest concern about RMNOCASE. My biggest concern is that I think RMNOCASE does a major disservice to RM users who need alphabets other than the English alphabet. Such users need sorting and searching to work natively as they would work in their own language. For example, Norwegian users need the letter Å to be sorted and searched as the 29-th letter of the alphabet rather than to be sorted and searched as an A which obviously is the first letter of the English (and Norwegian) alphabets. Well, I’m a little unclear about searching for the Norwegian Å, but RMNOCASE certainly seems to sort Å as if it were an A. As another example, there are problems in RM searching and sorting French diacritics, and I’m not even sure that’s so much an RMNOCASE problem as it is an underlying problem with UNICODE itself.

I have spent some considerable time through the years trying to develop a comprehensive understanding of sorting and searching various alphabets. I have not succeeded very well. I don’t think all the issues are all written down anywhere in one place. But here is a brief and surely very incomplete lists of some of the issues.

  • UNICODE is the code that has been developed to support all the alphabets in the world, including all manner of special symbols and icons. For example, it supports the Latin, Greek, Hebrew, and Arabic alphabets as well as the glyphic written languages such as Chinese and Japanese. It even supports Egyptian hieroglyphics and the icons that are so prevalent on smart phones such as the smiley.
  • UNICODE is a 21 bit code. That’s a very peculiar number. As a practical matter, that means that software that processes pure UNICODE needs to store it as 32 bit words. Since the programming language I speak the best these days is C++, that means using wchar variables rather than char variables, where wchar variables are 32 bits long and char variables are 8 bits long.
  • However, computer files and databases are almost never stored as pure UNICODE. Rather, they are stored as encoded strings such as UTF-8 or UTF-16, and UTF-8 seems to be more common.
  • Characters in SQLite databases (and hence in RM) are stored as UTF-8 strings. There are no other options in SQLite.
  • UTF-8 strings are stored as 8 bit bytes, usually called octets. Characters such as the English alphabet that can be stored as 8 bit bytes in ASCII or ANSII can also be stored as a single 8 bit byte in UTF-8. Other characters and other alphabets usually require more than one byte for each character in a UTF-8 encoding. Anything that can be stored in pure UNICODE’s 21 bits can also be stored in UTF-8’s 8 bits, That’s why UTF-8 is an encoding. You can almost think of UTF-8 as a data compression technique for storing UNICODE data.
  • I have never seen this written down explicitly, but my sense is that programs should process pure 21 bit UNICODE characters and not actually use UTF-8 encoding for processing. When reading a UTF-8 file, programs should convert the UTF-8 encoding to pure UNICODE for processing, and should convert the pure UNICODE back to UTF-8 for writing However, for files that contain only the old ASCII/ANSII characters, it is possible as a practical matter to process the UTF-8 encoded strings directly and bypass the steps of expanding them to pure UNICODE and then compressing them back down to UTF-8.
  • I am really fuzzy about how UNICODE intends to handle sorting and searching for things like diacritics in French (which don’t make a new letter) vs. letters such as Å in Norwegian which are their own separate letter. (By the way, Wikipedia says that the German Ü is a character but not a letter. But I remember being in West Berlin in the 1980’s and using a German keyboard that had Ü as a separate key. The distinction between a letter and a character can sometimes be very subtle.)
  • Relational databases such as SQLite support the definition of collating sequences for columns in a database. It seems natural to use the collating sequence of a database column as the tool for sorting and searching. But that would mean having database level collating sequences. And it would mean that any particular database would need to be language specific. That doesn’t seem like a good idea, for example, if a Canadian user wanted to print reports in either English or French from the same database and if they wanted the sorting to be correct for the language in the report.
  • My experience using SQLite queries for RM databases suggests that SQLite itself is expanding UTF-8 encoded strings to pure UNICODE in some cases, but in other cases SQLite is using the UTF-8 encoded strings “as is” without expanding them to pure UNICODE. For example, the SQLite “GROUP BY” function seems to operate on data as purely a string of bits without any consideration of what the string of bits means.

Well, I’m at or beyond my knowledge of pure UNICODE and UTF-8 encoding and how they are used by SQLite and by RM. I’m sure that not all of my observations above are completely correct. Feel free to correct anything you see that is wrong. My main conclusion is that UNICODE and UTF-8 and sorting and searching are really complicated, and I’m sure that I don’t understand it all. My secondary conclusion is that RM’s RMNOCASE collating sequence is not adequate for true multilingual support. Something much more comprehensive is really required.

1 Like

I wish I was able to add to or correct what you’ve said but, no, that merely expanded my limited understanding! A few years ago, I assembled a vast collection of Unicode ( or was it UTF-8 ?) characters in RM or sqlite3 to see how they were handled in RM. I came away quite confused…

If someone is interested, I meanwhile got my Entity Framwork ORM wrapper (rmSharp) going and pubished it at GitHub:

It still has some open ends but can already be used quite nicely. Here a quick example which selects all male persons which have more than one family and prints out the person, its families and its children. (The full code can be found in the Examples folder of the repository)

 static void Main(string[] args)
 {
     DB.sqLiteFile = "../../../../example_databases/US_Presidents.rmTree";  // database file to be set only once 

     using (var db = new DB()) // connect to the database
     {
         foreach (var person in db.Persons.Where(p => p.Families.Count > 1))  // persons who 'married' at least twice
         {
             Console.WriteLine($"{person.PrimaryName} ({person.PersonId})");
             foreach (var family in person.Families)
             {
                 Console.WriteLine($"  Family: {family}");
                 foreach (var child in family.Children)
                 {
                     Console.WriteLine($"    Child: {child}");
                 }
             }
             Console.WriteLine();
         }
     }
 }

which prints:

 ... 
 REAGAN Ronald Wilson (89)
   Family: REAGAN Ronald Wilson & ROBBINS Nancy_DAVIS (Anne Frances
     Child: REAGAN Patricia_Ann DAVIS (
     Child: REAGAN Ronald Prescott
   Family: REAGAN Ronald Wilson & FULKS Jane_WYMAN (Sarah Jane
     Child: REAGAN Maureen Elizabeth
     Child: REAGAN Michael Edward
     Child: REAGAN ? (Girl)

 LUCKETT Edith (113)
   Family: ROBBINS Kenneth & LUCKETT Edith
     Child: ROBBINS Nancy_DAVIS (Anne Frances
   Family: DAVIS Loyal & LUCKETT Edith

 DAVIS Loyal (118)
   Family: DAVIS Loyal & LUCKETT Edith
   Family: DAVIS Loyal &
     Child: DAVIS Richard
 ...

@keithcstone I’ll check if it is possible to use your collation method as an optional setting. Currently it uses unifuzz64.dll