Python RAM Leaks

Hello everyone,

I had opened an issue on RAM Leaks of Dynamic Types (https://github.com/opentap/opentap/issues/2318) not too long ago. This issue got solved and we do not have issues with mixins anymore causing ram leakage, which is great. Thank you for the quick fix for this. After further testing after this fix we ran into some other issues however in the Python Plugin. We had similar symptoms as in issue 2318: Reloading and running repeatedly on the same session leads to an enormous increase in RAM over time. Unfortunately this seems like a bit more complex issue also including the pythonnet fork. We made some research on the sources of the leaks and were able to reduce them to a much smaller amount but not completely to 0. I am happy to provide further information on our findings and our temporary fixes so far if necessary.

I have added a reproduction program to verify the leak quickly with:

# Trivial Python-derived TestStep — any python step triggers the leak.

# The point is the *kind* of object (Python-derived ClassDerived), not what it does.

import opentap

from opentap import *

import OpenTap

from OpenTap import Verdict

@attribute(OpenTap.Display(“Leak Step”, “Trivial python step used by mem_leak_repro.”, “TapX”))

class LeakStep(TestStep):

def \__init_\_(self):

    super().\__init_\_()

    self.Name = "Leak Step"



def Run(self):

    self.UpgradeVerdict(Verdict.Pass)

Program.cs (7.5 KB)

Hi @mavo1,

thanks for raising this problem.

According to the PythonNet source CLRObject.reflectedObjects is a HashSet, so it should not cause leaks. Can you let me know what exactly you did to fix the issue? We should probably get that integrated into pythonnet as soon as possible. We are anyway aiming for an update of the pythonnet dependency for python 3.14 support.

Hey Rolf,

thank you for the reply. I have made a small document of the things we thought could be the issues and how we tried to fix them. I attached it for you. We made some other minor fixes that were necessary with those changes but in general this was what brought our ram issues down by quite a bit but didnt solve them entirely. Glad to here that you are planning to update the pythonnet dependencies

OPENTAP_PYTHON_RAM_REPORT.txt (7.9 KB)

Hi @mavo1, it seems something went wrong uploading the text file, its pointing to some 404 message.

Ah yes sorry for that I will just put the readme file directly in here.

# OpenTAP + pythonnet — Memory-Leak Fixes

Reference baselines: pythonnet `v3.0.1` (`42ee643`), `OpenTap.Python` 3.2.1 (`eb8f477`).

Empirically reproduced on Windows x64 (Python 3.11) and Linux ARM64 (Revolution Pi). Reproducer: 50 plans × 50 trivial Python `TestStep` instances per plan, `GC.GetTotalMemory(forceFullCollection: true)` between iterations.

| Build | Growth over 50 × 50 step instances |

|—|—|

| Stock pythonnet `v3.0.1` | **+39 MB** (~16 KB / step instance, monotonic) |

| With the fixes below | **+1 MB** (~400 B / step instance) |

-–

## 1. `CLRObject.reflectedObjects` is append-only

**Where:** `src/runtime/Types/ClrObject.cs`, the static `internal static readonly HashSet reflectedObjects`.

**What goes wrong:** every CLR object exposed to Python is recorded in this hash set (via `CLRObject.GetReference` and `ToPython`). Entries are inserted but **never removed**. When the Python wrapper’s refcount hits zero, the `IntPtr` stays in the set, the underlying `GCHandle` and the wrapper’s `_dict_` (which holds Python state of any Python-derived `TestStep`) survive until process exit. The entries are reachable from a static field, so `GC.Collect(GC.MaxGeneration, …, blocking: true, compacting: true)` does **not** reclaim them.

This is the dominant leak source — neutralizing only this fix in our build reproduces the +39 MB stock-pythonnet number; restoring it brings growth back to +1 MB.

**Fix (pythonnet):** add an explicit eviction API:

```csharp

// CLRObject.cs

internal static int EvictAbandonedObjects(Func<object, bool>? isAbandoned = null)

{

lock (reflectedObjects)

{

    var snapshot = reflectedObjects.ToArray();

    int released = 0;



    // Pass 1 --- phantoms: rc<=1, GCHandle invalid, \__dict_\_ already cleared.

    // Always safe --- nothing on the Python side can resurrect these.

    var preEvictRc1 = new HashSet<IntPtr>();

    foreach (var addr in snapshot)

        if (Runtime.Refcount(addr) <= 1) preEvictRc1.Add(addr);



    foreach (var addr in snapshot)

        if (reflectedObjects.Contains(addr) && IsPhantom(addr)) {

            FreeGCHandle(addr);

            reflectedObjects.Remove(addr);

            released++;

        }



    // Pass 2 --- caller-driven abandonment predicate. Skip entries that were

    // already at rc==1 \*before\* pass 1 --- those are still-live objects whose

    // owners are mid-call.

    if (isAbandoned != null) {

        foreach (var addr in snapshot) {

            if (preEvictRc1.Contains(addr)) continue;

            if (!reflectedObjects.Contains(addr)) continue;

            var managed = TryGetManaged(addr);

            if (managed != null && isAbandoned(managed)) {

                FreeGCHandle(addr);

                reflectedObjects.Remove(addr);

                released++;

            }

        }

    }

    return released;

}

}

```

Re-export from `Runtime.cs` so external assemblies can call it without referencing internals:

```csharp

public static int EvictAbandonedObjects(Func<object, bool>? isAbandoned = null)

=> CLRObject.EvictAbandonedObjects(isAbandoned);

```

-–

## 2. No driver to call the eviction

**Where:** `OpenTap.Python` package — currently nothing calls into pythonnet’s eviction (even if it existed).

**Fix:** ship an `ITestPlanRunMonitor` that periodically drains the finalizer queue and calls `Runtime.EvictAbandonedObjects(IsAbandoned)`:

```csharp

public sealed class PythonMemoryMonitor : ComponentSettings,

                                      ITestPlanRunMonitor

{

\[Display("Cleanup Interval (s)")\] public int CleanupIntervalSeconds { get; set; } = 30;



private static readonly object \_liveLock = new();

private static readonly List<WeakReference> \_live = new();



private Thread \_cleanupThread;

private readonly ManualResetEventSlim \_stop = new(false);



public void EnterTestPlanRun(TestPlanRun run)

{

    var plan = (TestPlan)typeof(TestPlanRun)

        .GetField("plan", BindingFlags.Instance | BindingFlags.NonPublic)!

        .GetValue(run)!;

    lock (\_liveLock) \_live.Add(new WeakReference(plan));



    \_stop.Reset();

    \_cleanupThread = new Thread(CleanupLoop) {

        IsBackground = true,        // ★ NOT a thread-pool job --- see issue 3

        Name = "PythonMemoryMonitor"

    };

    \_cleanupThread.Start();

}



public void ExitTestPlanRun(TestPlanRun run)

{

    \_stop.Set();

    \_cleanupThread?.Join(TimeSpan.FromSeconds(5));

}



private void CleanupLoop()

{

    while (!\_stop.Wait(TimeSpan.FromSeconds(CleanupIntervalSeconds)))

    {

        if (!PythonEngine.IsInitialized) continue;

        using (Py.GIL()) Runtime.EvictAbandonedObjects(IsAbandoned);

    }

}



private static bool IsAbandoned(object inst)

{

    if (inst is TestPlan tp) return !IsLive(tp);

    if (inst is ITestStepParent node) {

        for (int i = 0; i < 64 && node != null; i++) {  // bounded against cycles

            if (node is TestPlan p) return !IsLive(p);

            var next = node.Parent;

            if (ReferenceEquals(next, node)) break;

            node = next;

        }

        return true;            // free-floating step graph

    }

    return false;               // global resource (ComponentSettings, ...)

}



private static bool IsLive(TestPlan p) {

    lock (\_liveLock)

        foreach (var w in \_live)

            if (ReferenceEquals(w.Target, p)) return true;

    return false;

}

}

```

-–

## 3. ThreadPool growth from stuck GIL callbacks

**Where:** any periodic cleanup invocation that uses `ThreadPool.QueueUserWorkItem`, `Task.Run`, or `System.Threading.Timer` to call into pythonnet under the GIL.

**What goes wrong:** if the GIL is held by a long-running Python callback (or a `Result Listener` mid-publish), the queued work item blocks waiting for the GIL. The `ThreadPool` interprets the stuck thread as starvation and grows another worker — which also blocks. We measured **8 000+ leaked threads** on ARM64 inside a multi-hour soak run before identifying this as the cause. Each thread is ~8 MB of committed stack on Linux, so the secondary leak dwarfs the primary one over time.

**Fix:** use a dedicated `IsBackground = true` `Thread`, not the pool, for any periodic work that must acquire the GIL. The `PythonMemoryMonitor` sketch in issue #2 already does this.

-–

## 4. `PyObject` lifetime not anchored to the .NET object

**Where:** `OpenTap.Python/PythonTypeDataWrapper.cs` — wherever the plugin caches or returns a `PyObject` derived from a managed instance.

**What goes wrong:** the .NET GC is free to collect a `PyObject` before pythonnet’s `reflectedObjects` slot for the same `IntPtr` is cleared. The next `GetReference(sameObj, sameType)` call returns the address of the dead wrapper. With the eviction sweep from issue #1 active this becomes visible: previously-permanent zombies are now actually freed, and any code path that re-uses a stale `PyObject` reads released memory.

**Fix:** anchor the `PyObject` lifetime to the .NET object via a `ConditionalWeakTable` so the entry disappears exactly when the .NET object becomes unreachable — never sooner, never later:

```csharp

// PythonTypeDataWrapper.cs

private static readonly ConditionalWeakTable<object, PyObject> _aliveObjects = new();

public PyObject ToPyObject(object netInstance)

{

if (\_aliveObjects.TryGetValue(netInstance, out var existing)) return existing;

var fresh = /\* CLRObject.GetReference(...) wrapped as PyObject \*/;

\_aliveObjects.Add(netInstance, fresh);

return fresh;

}

```

`ConditionalWeakTable` keeps the value alive only as long as the *key* is alive, but never the other way round, so the table itself adds zero retention and prevents the dangling-pointer regression that the eviction pass would otherwise expose.