DeepLearning Dataclass Guide

Okay, let’s craft a general and uniform guide for building a dataset class for image processing large models, focusing on PyTorch and PyTorch Lightning. This structure is highly adaptable.

Core Principles for Your Dataset Class:

Uniformity: The interface (__init__, __len__, __getitem__) should be consistent.
Flexibility: Easily accommodate different data sources, label types, and transformations.
Efficiency: Load data on-the-fly, leverage multi-processing in DataLoader, and handle large datasets without excessive memory usage.
Clarity: Code should be well-commented and easy to understand.
Reproducibility: Ensure that given the same settings, the dataset behaves identically (especially important for train/val/test splits).

We’ll structure this around:

PyTorch Dataset: The fundamental building block.
PyTorch Lightning LightningDataModule: The recommended way to organize data loading, splitting, and DataLoader instantiation for training, validation, and testing in a clean, reproducible, and shareable manner.

Guide for Building an Image Processing Dataset Class

Step 1: The Core PyTorch `Dataset` (`torch.utils.data.Dataset`)

This class is responsible for loading a single sample (image and its corresponding label/target) and applying initial transformations.

  1import os
  2import torch
  3from torch.utils.data import Dataset
  4from PIL import Image # Or OpenCV: import cv2
  5import numpy as np
  6import pandas as pd # Optional: if using CSV/TSV for manifests
  7
  8class CustomImageDataset(Dataset):
  9    def __init__(self, image_paths, labels, transform=None, image_load_mode='RGB'):
 10        """
 11        Args:
 12            image_paths (list or pd.Series): List of full paths to images.
 13            labels (list or pd.Series): List of corresponding labels.
 14                                       Can be integers for classification,
 15                                       paths to masks for segmentation,
 16                                       bounding boxes for detection, etc.
 17            transform (callable, optional): Optional transform to be applied
 18                                            on a sample.
 19            image_load_mode (str): 'RGB', 'L' (grayscale), etc. for PIL.
 20                                   For OpenCV, you'd handle color conversion manually.
 21        """
 22        self.image_paths = image_paths
 23        self.labels = labels
 24        self.transform = transform
 25        self.image_load_mode = image_load_mode
 26
 27        # Sanity check
 28        if len(self.image_paths) != len(self.labels):
 29            raise ValueError("image_paths and labels must have the same length.")
 30        if len(self.image_paths) == 0:
 31            print("Warning: Initializing an empty dataset.")
 32
 33    def __len__(self):
 34        """Returns the total number of samples in the dataset."""
 35        return len(self.image_paths)
 36
 37    def __getitem__(self, idx):
 38        """
 39        Fetches the sample (image and label) at the given index.
 40
 41        Args:
 42            idx (int): Index of the sample to fetch.
 43
 44        Returns:
 45            tuple: (image, label) where image is the transformed image
 46                   and label is the processed label.
 47        """
 48        if torch.is_tensor(idx):
 49            idx = idx.tolist()
 50
 51        img_path = self.image_paths[idx]
 52        try:
 53            # --- Image Loading ---
 54            # Using Pillow (PIL)
 55            image = Image.open(img_path).convert(self.image_load_mode)
 56            # If using OpenCV:
 57            # image = cv2.imread(img_path)
 58            # image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB) # if model expects RGB
 59            # image = Image.fromarray(image) # Optional: convert to PIL Image if transforms expect it
 60
 61        except Exception as e:
 62            print(f"Error loading image {img_path}: {e}")
 63            # Handle error: return a placeholder, skip, or re-raise
 64            # For simplicity, let's try to return a dummy if an error occurs
 65            # This is often not ideal for training, better to filter out bad data beforehand.
 66            if self.transform: # If transforms expect a PIL image, provide one
 67                dummy_image = Image.new(self.image_load_mode, (224, 224), color='gray')
 68                dummy_label = torch.tensor(-1) # Or appropriate dummy label
 69                sample = {'image': dummy_image, 'label': dummy_label}
 70                if self.transform:
 71                     sample = self.transform(sample)
 72                return sample['image'], sample['label'] # Or however your transform structures it
 73            else:
 74                return torch.zeros((3, 224, 224)), torch.tensor(-1) # Basic tensor dummy
 75
 76        # --- Label Loading & Processing ---
 77        # This part is highly dependent on your task:
 78        raw_label = self.labels[idx]
 79        label = self._process_label(raw_label)
 80
 81        # --- Apply Transformations ---
 82        # Transforms often expect a dictionary or specific inputs (e.g., PIL image for torchvision)
 83        # Albumentations is very flexible here.
 84        sample = {'image': image, 'label': label}
 85
 86        if self.transform:
 87            try:
 88                # Albumentations typically takes image=np.array(image), mask=np.array(mask)
 89                # Torchvision transforms typically take a PIL Image
 90                # Adapt this based on your chosen transform library
 91                # For example, if transform expects dict:
 92                transformed_sample = self.transform(sample)
 93                image = transformed_sample['image']
 94                label = transformed_sample['label']
 95                # Or if transform expects image directly and label is handled separately:
 96                # image = self.transform(image)
 97            except Exception as e:
 98                print(f"Error applying transform to {img_path}: {e}")
 99                # Handle transform error, similar to image loading error
100                # Fallback to untransformed or dummy if necessary
101                # Ensure image is a tensor if subsequent code expects it
102                if not isinstance(image, torch.Tensor): # Basic ToTensor if transform failed
103                    image = ToTensor()(image) if not isinstance(image, torch.Tensor) else image
104
105        return image, label
106
107    def _process_label(self, raw_label):
108        """
109        Helper function to process raw labels into the desired format.
110        This should be customized based on the task.
111        """
112        # Example: Classification (labels are integers or strings to be mapped)
113        if isinstance(raw_label, (int, np.integer)):
114            return torch.tensor(raw_label, dtype=torch.long)
115        elif isinstance(raw_label, str) and os.path.isfile(raw_label): # Path to a mask
116            # Example: Segmentation (label is a path to a mask image)
117            mask = Image.open(raw_label).convert('L') # Grayscale mask
118            # You might convert mask to np.array here if transforms expect it
119            return mask # Or np.array(mask)
120        elif isinstance(raw_label, (list, np.ndarray)): # BBoxes, keypoints
121            # Example: Object Detection (label is [x, y, w, h, class_id])
122            return torch.tensor(raw_label, dtype=torch.float32)
123        # Add more cases as needed (text for captioning, etc.)
124        else:
125            # Default or error for unhandled label types
126            # print(f"Warning: Unhandled label type: {type(raw_label)}. Returning as is or a default.")
127            return raw_label # Or raise error, or return a default tensor

Step 2: Data Source Parsing (Inside `init` or a helper)

How you get image_paths and labels depends on your dataset structure:

Directory Structure (e.g., for classification):
```
dataset_root/
    class_A/
        img1.jpg
        img2.png
    class_B/
        img3.jpeg
```
You’d parse this by walking directories. self.class_to_idx would be useful.

Manifest File (CSV, JSON, TSV):

1# manifest.csv
2image_path,label
3path/to/image1.jpg,class_A
4path/to/image2.jpg,class_B
5path/to/mask1.png,mask_for_image1.png # For segmentation

You’d use pandas to read this.

Example for directory structure parsing (add to CustomImageDataset.__init__ or a static method):

 1# ... inside CustomImageDataset or as a helper ...
 2@staticmethod
 3def _load_data_from_dirs(data_root, class_to_idx=None):
 4    image_paths = []
 5    labels = []
 6    if class_to_idx is None:
 7        class_to_idx = {cls_name: i for i, cls_name in enumerate(sorted(os.listdir(data_root)))}
 8
 9    for class_name, class_idx in class_to_idx.items():
10        class_dir = os.path.join(data_root, class_name)
11        if os.path.isdir(class_dir):
12            for fname in os.listdir(class_dir):
13                if fname.lower().endswith(('.png', '.jpg', '.jpeg', '.bmp', '.gif')):
14                    image_paths.append(os.path.join(class_dir, fname))
15                    labels.append(class_idx) # Integer label for classification
16    return image_paths, labels, class_to_idx
17
18# In __init__:
19# if using directory structure:
20# self.image_paths, self.labels, self.class_to_idx = self._load_data_from_dirs(data_root)

Step 3: Transformations (Augmentation & Preprocessing)

Use libraries like torchvision.transforms or albumentations. Albumentations is generally preferred for complex tasks (segmentation, detection) as it can transform images and masks/bboxes simultaneously.

 1from torchvision import transforms as T
 2import albumentations as A
 3from albumentations.pytorch import ToTensorV2
 4
 5# Example Transform function/class (can be passed to Dataset)
 6
 7# For Torchvision (simpler, often for classification)
 8def get_torchvision_transforms(image_size=(224, 224), is_train=True):
 9    common_transforms = [
10        T.Resize(image_size),
11    ]
12    if is_train:
13        augmentation_transforms = [
14            T.RandomHorizontalFlip(),
15            T.RandomRotation(15),
16            T.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
17        ]
18    else:
19        augmentation_transforms = []
20
21    normalization_transforms = [
22        T.ToTensor(), # Converts PIL Image or numpy.ndarray to tensor and scales to [0,1]
23        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # ImageNet stats
24    ]
25    return T.Compose(common_transforms + augmentation_transforms + normalization_transforms)
26
27
28# For Albumentations (more powerful, good for segmentation/detection)
29# This version assumes the transform is applied to a dictionary {'image': ..., 'label': ...}
30class AlbumentationsTransform:
31    def __init__(self, image_size=(224, 224), is_train=True):
32        if is_train:
33            self.transform = A.Compose([
34                A.Resize(height=image_size[0], width=image_size[1]),
35                A.HorizontalFlip(p=0.5),
36                A.Rotate(limit=15, p=0.3),
37                A.RandomBrightnessContrast(p=0.3),
38                A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
39                ToTensorV2(), # Converts image & mask to PyTorch tensors
40            ])
41        else:
42            self.transform = A.Compose([
43                A.Resize(height=image_size[0], width=image_size[1]),
44                A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
45                ToTensorV2(),
46            ])
47
48    def __call__(self, sample):
49        image, label = sample['image'], sample['label']
50
51        # Albumentations expects numpy arrays for image and mask
52        image_np = np.array(image)
53
54        # Handle different label types for Albumentations
55        if isinstance(label, Image.Image): # e.g., segmentation mask
56            mask_np = np.array(label)
57            transformed = self.transform(image=image_np, mask=mask_np)
58            return {'image': transformed['image'], 'label': transformed['mask']}
59        elif isinstance(label, torch.Tensor) and label.ndim == 0: # Classification label
60            transformed = self.transform(image=image_np) # Only transform image
61            return {'image': transformed['image'], 'label': label} # Keep label as is
62        elif isinstance(label, (list, np.ndarray)): # BBoxes, keypoints
63            # Ensure bboxes are in a supported format for Albumentations
64            # e.g., transformed = self.transform(image=image_np, bboxes=label, category_ids=...)
65            # This part needs careful setup based on bbox format.
66            # For simplicity, let's assume only image transform for now if not mask
67            transformed = self.transform(image=image_np)
68            return {'image': transformed['image'], 'label': torch.as_tensor(label) if not isinstance(label, torch.Tensor) else label}
69        else: # Fallback for other types or if label doesn't need transforming with image
70            transformed = self.transform(image=image_np)
71            return {'image': transformed['image'], 'label': label}
72
73# Usage in CustomImageDataset:
74# self.transform = AlbumentationsTransform(is_train=True)
75# or
76# self.transform = get_torchvision_transforms(is_train=True)
77# And then in __getitem__, if using AlbumentationsTransform:
78# sample = self.transform({'image': image, 'label': label})
79# image, label = sample['image'], sample['label']
80# If using torchvision transforms (which usually take image directly):
81# image = self.transform(image)
82# label = torch.tensor(label, dtype=torch.long) # Ensure label is tensor

Key point: Your __getitem__ needs to know what format the transform callable expects (e.g., just an image, or a dict of image and mask). The AlbumentationsTransform class above shows how to handle dicts.

Step 4: The PyTorch Lightning `LightningDataModule`

This module encapsulates all data-related logic: downloading, cleaning, splitting, and creating DataLoaders.

  1import pytorch_lightning as pl
  2from torch.utils.data import DataLoader, random_split
  3
  4class CustomImageDataModule(pl.LightningDataModule):
  5    def __init__(self,
  6                 data_root: str = None,          # Path to data (e.g. dir of class folders or manifest parent)
  7                 manifest_train: str = None,   # Path to train manifest CSV/JSON (optional)
  8                 manifest_val: str = None,     # Path to val manifest CSV/JSON (optional)
  9                 manifest_test: str = None,    # Path to test manifest CSV/JSON (optional)
 10                 train_val_test_split: tuple = (0.7, 0.15, 0.15), # If splitting from one source
 11                 image_size: tuple = (224, 224),
 12                 batch_size: int = 32,
 13                 num_workers: int = 4,
 14                 pin_memory: bool = True,
 15                 persistent_workers: bool = True, # For PyTorch 1.8+
 16                 image_load_mode: str = 'RGB',
 17                 transform_lib: str = 'albumentations' # 'torchvision' or 'albumentations'
 18                ):
 19        super().__init__()
 20        self.save_hyperparameters() # Saves all __init__ args to self.hparams
 21
 22        self.data_root = data_root
 23        self.manifest_train = manifest_train
 24        self.manifest_val = manifest_val
 25        self.manifest_test = manifest_test
 26        self.train_val_test_split = train_val_test_split
 27        self.image_size = image_size
 28        self.batch_size = batch_size
 29        self.num_workers = num_workers if num_workers is not None else os.cpu_count()
 30        self.pin_memory = pin_memory
 31        self.persistent_workers = persistent_workers if self.num_workers > 0 else False
 32        self.image_load_mode = image_load_mode
 33        self.transform_lib = transform_lib
 34
 35        self.train_transforms = self._get_transforms(is_train=True)
 36        self.val_test_transforms = self._get_transforms(is_train=False)
 37
 38        self.dataset_train = None
 39        self.dataset_val = None
 40        self.dataset_test = None
 41        self.class_to_idx = None # Will be populated in setup
 42
 43    def _get_transforms(self, is_train=True):
 44        if self.transform_lib == 'albumentations':
 45            return AlbumentationsTransform(image_size=self.image_size, is_train=is_train)
 46        elif self.transform_lib == 'torchvision':
 47            return get_torchvision_transforms(image_size=self.image_size, is_train=is_train)
 48        else:
 49            raise ValueError(f"Unsupported transform_lib: {self.transform_lib}")
 50
 51    def _load_manifest_data(self, manifest_path):
 52        """Loads image paths and labels from a CSV manifest."""
 53        if not manifest_path: return [], []
 54        df = pd.read_csv(manifest_path)
 55        # Assuming columns 'image_path' and 'label'
 56        # If image_path is relative, make it absolute:
 57        # df['image_path'] = df['image_path'].apply(lambda x: os.path.join(self.data_root, x) if self.data_root else x)
 58        return df['image_path'].tolist(), df['label'].tolist()
 59
 60
 61    def prepare_data(self):
 62        """
 63        Download, extract, or preprocess data here.
 64        Called once per node. Good for shared dataset setup.
 65        Example:
 66        - torchvision.datasets.MNIST(self.data_dir, train=True, download=True)
 67        - Check if self.data_root exists, if manifest files exist.
 68        """
 69        # This method is for things that should only happen once, e.g., downloading.
 70        # If your data is already on disk, you might not need to do much here.
 71        if self.data_root and not os.path.exists(self.data_root):
 72            raise FileNotFoundError(f"Data root {self.data_root} not found.")
 73        if self.manifest_train and not os.path.exists(self.manifest_train):
 74            raise FileNotFoundError(f"Train manifest {self.manifest_train} not found.")
 75        # etc. for val/test manifests
 76
 77    def setup(self, stage: str = None):
 78        """
 79        Assign train/val/test datasets for Dataloaders.
 80        Called on every GPU in DDP.
 81        - Load data from disk/manifests
 82        - Create train/val/test splits
 83        - Apply transforms
 84        - Assign to self.dataset_train, self.dataset_val, self.dataset_test
 85        """
 86        image_paths_all, labels_all = [], []
 87
 88        if self.manifest_train and self.manifest_val: # Pre-defined splits
 89            print("Loading data from pre-defined train/val manifests.")
 90            train_img_paths, train_labels = self._load_manifest_data(self.manifest_train)
 91            val_img_paths, val_labels = self._load_manifest_data(self.manifest_val)
 92
 93            self.dataset_train = CustomImageDataset(
 94                image_paths=train_img_paths, labels=train_labels,
 95                transform=self.train_transforms, image_load_mode=self.image_load_mode
 96            )
 97            self.dataset_val = CustomImageDataset(
 98                image_paths=val_img_paths, labels=val_labels,
 99                transform=self.val_test_transforms, image_load_mode=self.image_load_mode
100            )
101            if self.manifest_test:
102                test_img_paths, test_labels = self._load_manifest_data(self.manifest_test)
103                self.dataset_test = CustomImageDataset(
104                    image_paths=test_img_paths, labels=test_labels,
105                    transform=self.val_test_transforms, image_load_mode=self.image_load_mode
106                )
107
108        elif self.data_root: # Splitting from a single source directory
109            print(f"Loading data from directory {self.data_root} and splitting.")
110            # This example assumes classification task where CustomImageDataset._load_data_from_dirs is used
111            # You'll need to adapt this if your CustomImageDataset loads data differently
112            all_image_paths, all_labels, self.class_to_idx = CustomImageDataset._load_data_from_dirs(self.data_root)
113
114            if not all_image_paths:
115                raise ValueError(f"No images found in {self.data_root}")
116
117            # Create a temporary full dataset to leverage random_split
118            full_dataset = CustomImageDataset(
119                image_paths=all_image_paths, labels=all_labels,
120                transform=None, # Apply transforms later per split
121                image_load_mode=self.image_load_mode
122            )
123
124            total_len = len(full_dataset)
125            train_len = int(total_len * self.train_val_test_split[0])
126            val_len = int(total_len * self.train_val_test_split[1])
127            test_len = total_len - train_len - val_len
128
129            if train_len + val_len + test_len != total_len: # Adjust for rounding
130                test_len += (total_len - (train_len + val_len + test_len))
131
132            print(f"Splitting: Train={train_len}, Val={val_len}, Test={test_len}")
133
134            # Reproducible splits
135            generator = torch.Generator().manual_seed(42) # Use a fixed seed
136            self.dataset_train, self.dataset_val, self.dataset_test = random_split(
137                full_dataset, [train_len, val_len, test_len], generator=generator
138            )
139
140            # Apply correct transforms to each subset
141            # Note: random_split returns Subset objects. We need to assign transforms to the
142            # underlying dataset or wrap them. A common way is to set transform on Subset.
143            self.dataset_train.dataset.transform = self.train_transforms # type: ignore
144            self.dataset_val.dataset.transform = self.val_test_transforms # type: ignore
145            self.dataset_test.dataset.transform = self.val_test_transforms # type: ignore
146
147        else:
148            raise ValueError("Must provide either data_root (for auto-split) or train/val manifests.")
149
150        print(f"Train dataset size: {len(self.dataset_train) if self.dataset_train else 0}")
151        print(f"Validation dataset size: {len(self.dataset_val) if self.dataset_val else 0}")
152        print(f"Test dataset size: {len(self.dataset_test) if self.dataset_test else 0}")
153
154    def train_dataloader(self):
155        if not self.dataset_train: self.setup('fit')
156        return DataLoader(
157            self.dataset_train,
158            batch_size=self.batch_size,
159            shuffle=True,
160            num_workers=self.num_workers,
161            pin_memory=self.pin_memory,
162            persistent_workers=self.persistent_workers,
163            drop_last=True # Good for training stability with some batch-dependent layers
164        )
165
166    def val_dataloader(self):
167        if not self.dataset_val: self.setup('fit')
168        return DataLoader(
169            self.dataset_val,
170            batch_size=self.batch_size,
171            shuffle=False,
172            num_workers=self.num_workers,
173            pin_memory=self.pin_memory,
174            persistent_workers=self.persistent_workers
175        )
176
177    def test_dataloader(self):
178        if not self.dataset_test: self.setup('test')
179        return DataLoader(
180            self.dataset_test,
181            batch_size=self.batch_size,
182            shuffle=False,
183            num_workers=self.num_workers,
184            pin_memory=self.pin_memory,
185            persistent_workers=self.persistent_workers
186        )
187
188    # Optional: For prediction
189    def predict_dataloader(self):
190        # Usually same as test_dataloader or a specific dataset for prediction
191        if not self.dataset_test: self.setup('predict') # Or a self.dataset_predict
192        return DataLoader(
193            self.dataset_test, # Or self.dataset_predict
194            batch_size=self.batch_size,
195            shuffle=False,
196            num_workers=self.num_workers,
197            pin_memory=self.pin_memory,
198            persistent_workers=self.persistent_workers
199        )

Step 5: Using the DataModule in Training

 1# Example:
 2# 1. Prepare dummy data for classification
 3# DATA_ROOT/
 4#   cat/
 5#     cat1.jpg, cat2.jpg
 6#   dog/
 7#     dog1.jpg, dog2.jpg
 8
 9# Create dummy images (run this once)
10# import os
11# from PIL import Image
12# DATA_ROOT = "dummy_data_root"
13# os.makedirs(os.path.join(DATA_ROOT, "cat"), exist_ok=True)
14# os.makedirs(os.path.join(DATA_ROOT, "dog"), exist_ok=True)
15# Image.new('RGB', (100, 100), color = 'red').save(os.path.join(DATA_ROOT, "cat/cat1.jpg"))
16# Image.new('RGB', (100, 100), color = 'green').save(os.path.join(DATA_ROOT, "cat/cat2.jpg"))
17# Image.new('RGB', (100, 100), color = 'blue').save(os.path.join(DATA_ROOT, "dog/dog1.jpg"))
18# Image.new('RGB', (100, 100), color = 'yellow').save(os.path.join(DATA_ROOT, "dog/dog2.jpg"))
19# Image.new('RGB', (100, 100), color = 'purple').save(os.path.join(DATA_ROOT, "dog/dog3.jpg"))
20
21# --- Main training script ---
22if __name__ == '__main__':
23    # Configure the DataModule
24    data_module = CustomImageDataModule(
25        data_root="dummy_data_root", # Use the dummy data path
26        train_val_test_split=(0.6, 0.2, 0.2), # For 5 images: 3 train, 1 val, 1 test
27        image_size=(64, 64),
28        batch_size=2,
29        num_workers=0, # Set to 0 for easier debugging, >0 for performance
30        transform_lib='torchvision' # or 'albumentations'
31    )
32
33    # (Optional) Inspect:
34    data_module.prepare_data() # Checks paths
35    data_module.setup()        # Creates datasets and splits
36
37    print(f"Class to index mapping: {data_module.class_to_idx}")
38
39    # Get a sample from train dataloader
40    train_loader = data_module.train_dataloader()
41    for i, batch in enumerate(train_loader):
42        images, labels = batch
43        print(f"\nBatch {i+1}:")
44        print("Images shape:", images.shape) # e.g., torch.Size([2, 3, 64, 64])
45        print("Labels:", labels)             # e.g., tensor([0, 1])
46        if i == 0: # Just show first batch
47            # To visualize (if you have matplotlib)
48            # import matplotlib.pyplot as plt
49            # img_to_show = images[0].permute(1, 2, 0).numpy() # C,H,W -> H,W,C
50            # # Denormalize if normalized (assuming ImageNet stats from example)
51            # mean = np.array([0.485, 0.456, 0.406])
52            # std = np.array([0.229, 0.224, 0.225])
53            # img_to_show = std * img_to_show + mean
54            # img_to_show = np.clip(img_to_show, 0, 1)
55            # plt.imshow(img_to_show)
56            # plt.title(f"Label: {labels[0].item()}")
57            # plt.show()
58            break
59
60    # --- Integrate with PyTorch Lightning Trainer ---
61    # model = YourLightningModel(...)
62    # trainer = pl.Trainer(max_epochs=10, accelerator='gpu', devices=1)
63    # trainer.fit(model, datamodule=data_module)
64    # trainer.test(datamodule=data_module) # if you have a test set and model.test_step

Key Considerations & Customizations:

Label Types:
- Classification: Labels are usually integers. self.class_to_idx is vital.
- Segmentation: Labels are mask images (e.g., HxW or HxWxNumClasses). Ensure transforms correctly handle masks.
- Object Detection: Labels are lists/tensors of bounding boxes ([x_min, y_min, x_max, y_max, class_id]) per image. Transforms need to adjust bboxes.
- Image Captioning: Labels are sequences of text tokens.
- Self-Supervised/Generative: May only need images, or image pairs.
Caching: For very slow __getitem__ (e.g., heavy online processing), consider caching processed samples to disk or memory (e.g., using joblib.Memory or LMDB). This adds complexity.
Large Datasets & Disk I/O:
- Use efficient image formats (e.g., WebP can be smaller than JPG/PNG).
- Consider num_workers > 0 in DataLoader for parallel data loading.
- pin_memory=True can speed up CPU-to-GPU transfer.
- If data is on network storage, local caching (e.g., copying to an SSD before training) might be beneficial.
Error Handling in __getitem__: Robustly handle corrupted images or missing files. You might choose to:
- Skip the sample (requires careful indexing or filtering).
- Return a placeholder sample (as shown in the example).
- Log errors and filter out problematic files in a pre-processing step.
Hugging Face datasets Library: For many standard datasets, or if your data is easily convertible to Arrow tables, the datasets library can be a powerful alternative or complement. It offers efficient loading, mapping, and caching. You can even wrap a Hugging Face dataset within a PyTorch Dataset.

This comprehensive guide provides a solid, uniform foundation. You’ll adapt the _process_label, data parsing logic, and transform choices based on your specific image processing task and dataset format. The LightningDataModule then provides the standardized framework for using it in your training pipeline.

Guide for Building an Image Processing Dataset Class#

Step 1: The Core PyTorch Dataset (torch.utils.data.Dataset)#

Step 2: Data Source Parsing (Inside __init__ or a helper)#

Step 3: Transformations (Augmentation & Preprocessing)#

Step 4: The PyTorch Lightning LightningDataModule#

Step 5: Using the DataModule in Training#

Key Considerations & Customizations:#