neisbut / Npgsql.Bulk

Helper for performing COPY (bulk insert and update) operation easily, using Entity Framework + Npgsql.

Geek Repo:Geek Repo

Github PK Tool:Github PK Tool

OutOfMemoryException when using ImportAsync

creste opened this issue · comments

I am trying to import 6 GiB of data using ImportAsync like this:

var bulkUploader = new NpgsqlBulkUploader(_context);
await bulkUploader.ImportAsync(messages);

Each message in messages is 10 KiB. When I run that code with 6 GiB of messages I quickly receive an OutOfMemoryException.

I analyzed the memory usage of the program and noticed that every message is being stored in memory. The only references to each message are from classes in EF Core. Specifically, these two classes are keeping references to every message:

  • Microsoft.EntityFrameworkCore.ChangeTracking.Internal.InternalClrEntityEntry
  • Microsoft.EntityFrameworkCore.ChangeTracking.Internal.EntityReferenceMap

I put a breakpoint in EF Core's StateManager to confirm entities were being added by Npgsql.Bulk. I put the breakpoint on this line of code: https://github.com/dotnet/efcore/blob/b970bf29a46521f40862a01db9e276e6448d3cb0/src/EFCore/ChangeTracking/Internal/StateManager.cs#L337

The breakpoint showed this call stack:

Microsoft.EntityFrameworkCore.dll!Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.UpdateReferenceMaps(Microsoft.EntityFrameworkCore.ChangeTracking.Internal.InternalEntityEntry entry, Microsoft.EntityFrameworkCore.EntityState state, Microsoft.EntityFrameworkCore.EntityState? oldState) Line 315
at /_/src/EFCore/ChangeTracking/Internal/StateManager.cs(315)

Microsoft.EntityFrameworkCore.dll!Microsoft.EntityFrameworkCore.ChangeTracking.Internal.StateManager.GetOrCreateEntry(object entity) Line 231
at /_/src/EFCore/ChangeTracking/Internal/StateManager.cs(231)

Npgsql.Bulk.dll!Npgsql.Bulk.ValueHelper<B.Message>.Get<System.DateTime, System.DateTime>(B.Message model, string propName, Microsoft.EntityFrameworkCore.DbContext context, System.DateTime localValue)

Message_72305025_8cb2_4d6d_92ed_f736f64690d5_1!Message_72305025_8cb2_4d6d_92ed_f736f64690d5_1.WriterForInsertAction(B.Message value, Npgsql.NpgsqlBinaryImporter value, Microsoft.EntityFrameworkCore.DbContext value)

Npgsql.Bulk.dll!Npgsql.Bulk.NpgsqlBulkUploader.ImportAsync<B.Message>(System.Collections.Generic.IEnumerable<Bt.Outbox.OutboxMessage> entities)

That call stack lead me to these lines of code in Npgsql.Bulk:

var entry = sm.GetOrCreateEntry(model);

If I'm reading that correctly, it means every object sent to ImportAsync will eventually be stored in the DbContext's state tracker as a detached entity. Those entities are not cleaned up until the DbContext is disposed.

I was surprised by that behavior because this comment in NpgsqlBulkUploader implies that ImportAsync can be used with large data sets:

/// Simplified version of Insert which works better for huge sets (not calling ToList internally).

Is there any way to avoid storing every object in the DbContext's state tracker? If not, is there some way to clear out the state tracker periodically while ImportAsync is importing rows?

As a workaround I can import rows in batches, but that creates other problems due to variations in the size of data that I am importing. It's not as easy as batching by the number of messages because certain message streams have small messages while others have very large messages. I'd prefer if Npgsql.Bulk could simply not store every message in memory so that I don't have to implement batching based on observed memory usage.

I found a workaround. I modified the IEnumerable passed to ImportAsync() to delete objects from EF Core's StateManager after each object is processed. There is no public method for deleting detached objects from the StateManager, so I had to use reflection to access the private method called UpdateReferenceMaps(). I wrapped calls to UpdateReferenceMaps() in an extension method. Here is the extension method:

    internal static class DbContextExtensions
    {
        private static MethodInfo _updateReferenceMaps;

        static DbContextExtensions()
        {
            _updateReferenceMaps = typeof(StateManager).GetMethod("UpdateReferenceMaps",
                BindingFlags.NonPublic
                | BindingFlags.Instance
                | BindingFlags.InvokeMethod
            );
            if (_updateReferenceMaps == null)
            {
                throw new InvalidOperationException("failed to locate UpdateReferenceMaps via reflection");
            }
        }

        // The goal of this method is to call this line of code in EF Core:
        // https://github.com/dotnet/efcore/blob/a29f30cc5d334cd2c375cc8ba1092c95fd51bf06/src/EFCore/ChangeTracking/Internal/EntityReferenceMap.cs#L377
        internal static void RemoveDetachedEntry(this DbContext db, object entry)
        {
#pragma warning disable EF1001 // Internal EF Core API usage.
            var sm = (StateManager)((IDbContextDependencies)db).StateManager;

            var internalEntry = sm.TryGetEntry(entry);
            if (internalEntry != null)
            {
                _updateReferenceMaps.Invoke(sm, new object[] { internalEntry, EntityState.Detached, EntityState.Detached });
            }
#pragma warning restore EF1001 // Internal EF Core API usage.
        }
    }

Then I modified my usage of NpgsqlBulkUploader as follows:

Message lastMessage = null;
messages = messages.Select(m =>
{
    if (lastMessage != null)
    {
        _dbContext.RemoveDetachedEntry(lastMessage); // calls extension method defined above
    }          
    lastMessage = m;
    return m;
});

var bulkUploader = new NpgsqlBulkUploader(_dbContext);
await bulkUploader.ImportAsync(messages);

I no longer get an OutOfMemoryException and memory usage is relatively constant for the entire import operation.

I'm not happy about having to call internal EF Core APIs, but I see that Npgsql.Bulk is already doing that. I'm also not happy about having to use reflection to call a private method but I didn't see any other way to remove detached objects from EF Core's StateManager.

Hi @creste , yes, initially there were no GetOrCreateEntry call. Seem when I was changing something I didn't spot it is called now. Got a question, if you try to insert say 10k records and then changes state to Detached. Will memory be freed?

Hi @neisbut , thank you for replying.

I'm unable to find a way to set the record's state to "Detached" using public APIs because the DbContext doesn't think it has any entries, even though it is tracking all inserted records in it's detached reference map. Perhaps you can provide some guidance on how I can mark the records as detached since I might be misunderstanding the question.

Here is how I tried to set the record's state to detached after every 10,000 records are inserted:

await uploader.ImportAsync(items.Select((item, index) =>
{
    if (index % 10_000 == 0)
    {
        foreach (var entry in db.ChangeTracker.Entries())
        {
            entry.State = Microsoft.EntityFrameworkCore.EntityState.Detached;
        }
    }
    return item;
}));

When I run that code in a debugger, the foreach() loop never iterates over any entries because Entries() always returns 0 records, even when I've already imported 30,000 records. Yet, at the same time I can see the DbContext is tracking detached references to entities. Here is a screenshot showing where the memory leak occurs:

image

Notice how Entries shows there are no entries while _detachedReferenceMap is storing 30,000 references to the objects that were already imported. I'm guessing the call to GetOrCreateEntry() in ValueHelper is bypassing whatever mechanism is used by DbContext to know which entities are being tracked. That means I can't use the public methods on DbContext to get the list of entities. Instead, I have to use internal APIs that work with the StateManager directly. That's what the RemoveDetachedEntry() method in my prior comment does.

After experimenting further, I found another workaround that doesn't require reflection. The RemoveDetachedEntry() extension method I defined in my prior comment can be replaced with this implementation to remove the use of reflection:

        internal static void RemoveDetachedEntry(this DbContext db, object entry)
        {
#pragma warning disable EF1001 // Internal EF Core API usage.
            var sm = (StateManager)((IDbContextDependencies)db).StateManager;

            var internalEntry = sm.TryGetEntry(entry);
            if (internalEntry != null)
            {
                sm.StateChanging(internalEntry, EntityState.Detached);
            }
#pragma warning restore EF1001 // Internal EF Core API usage.
        }

That implementation also appears to fix the memory leak as long as RemoveDetachedEntry() is called once for each imported record.

Hi @creste, I did some changes in code. Now it is assumed that objects passed to Import(Async) method should be fully initialized and ready. That means entities won't be attached to StateManager at all. This should help to decrease memory usage. If this works then you won't need any additional manipulations. Code is commited to github now. Ideally it would be great if you could try this on your side before I upload this to NuGet. So is it possible for you to download and compile Npgsql.Bulk locally? Or better to upload it to nuget under some beta version?

Hi @neisbut, I removed my workaround and then downloaded, compiled, and tested the changes you made in ec3817d with my project. Those changes fixed the memory leak for me. Thank you!

Hi @creste , happy to hear that! I released this change in 0.8.4 version!

Hi @neisbut, I upgraded to 0.8.4 and confirmed the memory leak is still fixed. Thanks again!