PyTorch Bug: Corrupted Tensors On Failed Resize
Encountering unexpected behavior with PyTorch tensor operations can be a real head-scratcher, especially when it leads to crashes or unpredictable results. A particularly thorny issue has been identified where PyTorch updates tensor shape metadata even when a storage resize operation fails, leading to corrupted tensors. This problem, which can result in segmentation faults or internal runtime errors, occurs when you attempt to resize a tensor whose storage is shared with a non-resizable buffer, such as a NumPy array integrated using set_().
Let's dive deep into this peculiar bug, understand its mechanics, and explore how it impacts your workflow. We'll also provide a clear reproduction case and discuss the expected vs. actual behavior.
The Nitty-Gritty: How the Corruption Happens
The core of the problem lies in the resize_() operation within PyTorch. When you call resize_() on a tensor that shares its underlying storage with a buffer that cannot be resized (like a NumPy array), PyTorch is designed to raise a RuntimeError with the message: "Trying to resize storage that is not resizable." This is the correct and expected outcome, as you're attempting an operation that's fundamentally incompatible with the shared memory. However, the bug surfaces because this operation isn't exception-safe. Before the crucial check for resizable storage actually fails, PyTorch proceeds to update the tensor's shape and stride metadata to reflect the new target size you requested. This leaves the tensor in a precarious, inconsistent state – a kind of "Zombie tensor." The tensor.shape will report a seemingly valid, often larger, size, but its actual tensor.storage() remains empty, holding zero bytes of data. Any subsequent attempt to access or manipulate this corrupted tensor, such as printing it or performing operations that require valid storage, can lead to catastrophic failures, manifesting as segmentation faults or internal RuntimeError exceptions. This inconsistency between what the tensor claims to be (its shape) and what it actually is (its empty storage) is the root cause of the instability.
Visualizing the Corruption: A Minimal Reproduction Case
To truly grasp the severity and mechanics of this bug, it's best to see it in action. We can create a minimal reproduction case using PyTorch and NumPy. This example will clearly illustrate how a tensor can become corrupted.
import torch
import numpy as np
# Create non-resizable storage (0 bytes)
locked_storage = torch.from_numpy(np.array([], dtype=np.int32)).untyped_storage()
# Inject into a fresh tensor
t = torch.tensor([], dtype=torch.int32)
t.set_(locked_storage)
# Attempt to resize (Expected: Fail, maintain original shape)
# (Actual: Fails, but updates shape to 5x5x5)
try:
t.resize_((5, 5, 5))
except RuntimeError:
pass
# Verify corruption
print(f"Shape: {t.shape}") # Prints: torch.Size([5, 5, 5])
print(f"Storage: {t.untyped_storage().nbytes()}") # Prints: 0
print(t) # CRASH
In this code snippet, we first create a zero-byte untyped_storage using an empty NumPy array. We then create a new PyTorch tensor t and attach this locked_storage to it using t.set_(locked_storage). The critical step is attempting to resize this tensor to (5, 5, 5) within a try-except block. As expected, PyTorch correctly identifies that the storage cannot be resized and raises a RuntimeError. However, as the output demonstrates, the tensor's shape has already been updated to torch.Size([5, 5, 5]) before the exception was caught. The storage, on the other hand, remains at 0 bytes. The subsequent print(t) command, which attempts to access the tensor's data based on its reported shape, triggers the crash. In some environments, this might manifest as a RuntimeError during printing, while in others, it can lead to a more severe segmentation fault, depending on how the underlying C++ components handle the invalid state.
Expected vs. Actual Behavior: A Tale of Two Outcomes
Understanding the discrepancy between what should happen and what is happening is crucial for debugging and reporting such issues. The desired, robust behavior for this scenario aligns with the Strong Exception Guarantee. This guarantee states that if an operation fails (throws an exception), the program should be left in the state it was in before the operation began. In our case, if resize_() fails because the storage is not resizable, the tensor's metadata (shape and stride) should remain unchanged. It should continue to reflect its original shape, which in our minimal example is torch.Size([0]). This ensures that even if the resize operation fails, the tensor remains in a consistent and usable state, albeit with its original dimensions.
The actual behavior, as demonstrated by the reproduction case, violates this guarantee. Although the RuntimeError is correctly raised and caught, the tensor's metadata is left in an inconsistent state. The shape is misleadingly updated to torch.Size([5, 5, 5]), while the storage remains empty (0 bytes). This mismatch creates a