Performance Optimization and Parallelism

Last modified: February 13, 2024

This article is written in: 🇺🇸

Performance Optimization and Parallelism

When working with complicated datasets and sophisticated visualization pipelines, performance optimization and parallelism become important for delivering real-time or near-real-time insights. VTK (Visualization Toolkit) supports a variety of performance-enhancing techniques and offers a strong framework for parallel processing, allowing you to scale your visualization workflows to handle massive datasets or highly detailed 3D scenes. This section covers several key strategies to help optimize VTK-based applications:

Level of Detail (LOD)
Culling
Parallel Rendering and Processing

You can make sure your visualization pipeline remains responsive and efficient, even in demanding scenarios such as medical imaging, large-scale simulations, or interactive 3D modeling.

Level of Detail (LOD)

Level of Detail (LOD) is a common technique in computer graphics aimed at reducing the rendering load by simplifying objects based on their importance or visual impact. In large scenes or interactive applications, rendering the highest-quality version of every single object can become extremely expensive. LOD solves this by dynamically selecting an appropriate representation depending on factors such as:

The distance from the camera determines the level of detail, with objects farther from the viewer rendered at a lower resolution to reduce computational load while maintaining acceptable visual quality.
System performance or frame rate can influence LOD adjustments, where a drop in frame rate triggers a switch to simpler models, ensuring the application remains responsive and maintains interactivity.

LOD strategies help maintain smooth rendering and interactive frame rates even when dealing with very large or complicated 3D environments.

Classes Associated with LOD

VTK provides specialized classes to carry out LOD functionalities out of the box.

I. vtkLODActor

Automatic LOD adjustment dynamically modifies the level of detail for objects based on their distance from the camera or the current system performance constraints.
The system can manage multiple polygonal representations or mappers of the same geometry, with VTK intelligently switching between them depending on rendering needs.
Smooth transitions between levels of detail minimize visual artifacts like “popping,” ensuring a seamless viewing experience during LOD changes.
This approach helps balance visual quality and performance, especially in scenarios with complex scenes or when maintaining frame rates is critical.
Automatic LOD adjustment is particularly useful in applications like virtual reality, real-time simulations, and large-scale visualizations, where maintaining interactivity and responsiveness is key.

II. vtkLODProp3D

A generalized LOD (Level of Detail) interface offers a generic mechanism to manage multiple levels of detail for a single 3D object, adapting its complexity based on rendering needs.
This interface provides customization options, making it suitable for users requiring precise control over the creation, selection, and switching of LODs.
Flexible strategies enable developers to implement custom rendering approaches or define their own criteria for transitioning between different levels of detail.
The generalized interface supports scenarios where predefined LOD mechanisms may not be sufficient, accommodating unique requirements such as dynamic detail adjustments or procedural LOD generation.
By using this interface, rendering pipelines can optimize performance while maintaining visual fidelity in diverse applications like gaming, simulation, and large-scale visualization.

Example of Creating a `vtkLODActor`

Below is a simple example that demonstrates how to create and configure a vtkLODActor to handle different levels of detail:

import vtk

# Create an instance of vtkLODActor
lod_actor = vtk.vtkLODActor()

# Create a mapper for the LOD actor
mapper = vtk.vtkPolyDataMapper()

# For demonstration, configure a sphere source as the mapper’s input
sphere_source = vtk.vtkSphereSource()
sphere_source.SetThetaResolution(50)
sphere_source.SetPhiResolution(50)
mapper.SetInputConnection(sphere_source.GetOutputPort())

# Set the mapper for the LOD actor's default LOD (index 0)
lod_actor.SetMapper(mapper)

# Optionally, add other levels of detail using additional mappers.
# For instance, a lower-detail sphere:
low_res_mapper = vtk.vtkPolyDataMapper()
low_res_sphere = vtk.vtkSphereSource()
low_res_sphere.SetThetaResolution(10)
low_res_sphere.SetPhiResolution(10)
low_res_mapper.SetInputConnection(low_res_sphere.GetOutputPort())

# Add the lower-resolution mapper as an LOD
lod_actor.AddLODMapper(low_res_mapper)

# Optionally set resolution overrides (if needed)
lod_actor.SetLODResolution(0, 100)   # High-resolution LOD
lod_actor.SetLODResolution(1, 10)    # Low-resolution LOD

We create a vtkLODActor, which automatically manages multiple LOD mappers.
We define a high-resolution sphere (sphere_source) and a low-resolution sphere (low_res_sphere), each tied to its own vtkPolyDataMapper.
The first mapper is set as the default high-res LOD, and the second low_res_mapper is added as a secondary LOD.
SetLODResolution(index, value) can be used to control how many polygons or level of detail each mapper uses.

When rendered, vtkLODActor decides whether to use the high- or low-resolution version based on camera distance and/or rendering performance goals.

Culling

Culling is another powerful method for optimizing rendering performance by removing objects or parts of objects that do not contribute to the final image. Common types of culling include:

Frustum culling is a technique that excludes objects outside the camera's viewing frustum, a 3D region shaped like a pyramid or truncated pyramid defined by the camera's position, orientation, and field of view. Objects lying completely outside this region are skipped during rendering, saving computational resources.
This method works by testing the bounding volume of each object against the boundaries of the frustum. If the volume lies entirely outside the frustum, the object is culled. This test is performed early in the rendering pipeline, ensuring non-visible objects are excluded before further processing.
Occlusion culling complements frustum culling by removing objects that are entirely obscured by other objects from the camera’s perspective. This involves checking if an object is behind other geometry relative to the camera view and not visible in the final frame.
Occlusion culling often relies on depth-buffer information or specialized algorithms like hierarchical z-buffering to efficiently detect hidden objects. Unlike frustum culling, which excludes objects based on their position, occlusion culling focuses on visibility relative to other objects in the scene.

These techniques save on both geometry processing and rasterization time since fewer objects must be transformed, shaded, and drawn.

Key Classes for Culling

I. vtkFrustumCuller

Frustum-based culling checks whether an object lies entirely outside the camera’s view frustum, a volume defined by the camera’s field of view.
Objects outside the frustum are automatically excluded early, avoiding costly rendering operations.
This approach is broadly efficient, especially in large-scale scenes like CAD models or complex 3D environments, where many objects fall outside the visible viewport.

II. vtkVisibilityCuller

Occlusion-based culling determines whether an object is hidden behind another object from the camera’s perspective, avoiding unnecessary rendering.
Depth testing uses a depth-based approach to detect and exclude occluded geometry from the rendering pipeline.
Resource savings are achieved, particularly in densely populated scenes where many objects overlap or block each other, reducing rendering workload.

Example of Using `vtkFrustumCuller`

Below is an example showcasing how to integrate vtkFrustumCuller into a simple VTK pipeline:

import vtk

# Create a renderer
renderer = vtk.vtkRenderer()

# Create an instance of vtkFrustumCuller
frustum_culler = vtk.vtkFrustumCuller()

# Add the frustum culler to the renderer
renderer.AddCuller(frustum_culler)

# Create a rendering window and add the renderer
render_window = vtk.vtkRenderWindow()
render_window.AddRenderer(renderer)

# Create a render window interactor
interactor = vtk.vtkRenderWindowInteractor()
interactor.SetRenderWindow(render_window)

# Optional: Add some geometry (e.g., a large set of spheres) to see culling effects
for i in range(10):
sphere_source = vtk.vtkSphereSource()
sphere_source.SetCenter(i * 2.0, 0, 0)

mapper = vtk.vtkPolyDataMapper()
mapper.SetInputConnection(sphere_source.GetOutputPort())

actor = vtk.vtkActor()
actor.SetMapper(mapper)
renderer.AddActor(actor)

renderer.SetBackground(0.1, 0.2, 0.4)

render_window.Render()
interactor.Start()

A vtkFrustumCuller is added to the renderer using the renderer.AddCuller(...) method to optimize rendering.
Geometry setup involves placing multiple spheres along the x-axis, with those outside the camera's default frustum being culled to conserve rendering resources.
After configuring the renderer with the culler, objects outside the frustum are automatically excluded from the rendering pipeline, enhancing performance.

Parallel Rendering and Processing

As datasets grow in size and complexity, single-threaded or single-processor visualization pipelines can become bottlenecks. To tackle this, VTK offers parallel rendering and parallel processing capabilities that harness the power of multiple CPUs, multiple GPUs, or clusters of networked machines. These methods are necessary for high-end data visualization tasks—such as astrophysical simulations, seismic data interpretation, or climate modeling—where interactivity and real-time feedback are important yet challenging to achieve.

Parallel Rendering

Parallel rendering splits the rendering workload across multiple processors or GPUs:

Faster rendering is achieved by distributing tasks such as geometry transformation, rasterization, and compositing, which reduces the time required to render each frame.
Scalability allows handling larger scenes or datasets by adding more GPU nodes or parallel resources to distribute the load.
In the sort-first approach, the screen space is divided among different renderers, with each responsible for rendering a specific segment of the output image.
In the sort-last approach, the data is divided among compute nodes, each rendering its portion, and the results are merged through compositing.

Parallel Processing

While parallel rendering focuses on visual output, parallel processing addresses data computation itself:

Data decomposition involves splitting a dataset into chunks that can be processed independently, such as partitioning a mesh or dividing an image volume into subvolumes.
Task decomposition refers to breaking down computation into subtasks that can execute in parallel, such as applying multiple filters to different subregions of data.
Synchronization and communication are essential for processes to coordinate, share intermediate results, maintain consistency, and combine outputs effectively.

Parallel processing is important for:

Large-Scale Simulations (e.g., weather modeling, computational fluid dynamics).
Big Data Analytics (e.g., processing millions of data points or high-resolution volumetric datasets).
Real-Time Applications (e.g., streaming sensor data in medical imaging or autonomous vehicle systems).

Example of Parallel Rendering in VTK

Below is a simplified example illustrating how you might configure VTK for parallel rendering. True high-performance parallel rendering often requires specialized hardware setups or distributed rendering servers, but this example demonstrates the core concepts:

import vtk

# Create a rendering window
render_window = vtk.vtkRenderWindow()

# Configure the window for parallel rendering (if available)
# In practice, you would need a multi-GPU system or a distributed cluster.
render_window.SetMultiSamples(0)        # Disable multisampling for clarity
render_window.SetNumberOfLayers(2)      # Use multiple layers for compositing in parallel

# Create a renderer and set a background color
renderer = vtk.vtkRenderer()
renderer.SetBackground(0.1, 0.1, 0.1)
render_window.AddRenderer(renderer)

# Create a render window interactor
interactor = vtk.vtkRenderWindowInteractor()
interactor.SetRenderWindow(render_window)

# Create a basic geometry (e.g., sphere) to visualize
sphere_source = vtk.vtkSphereSource()
sphere_source.SetThetaResolution(30)
sphere_source.SetPhiResolution(30)

mapper = vtk.vtkPolyDataMapper()
mapper.SetInputConnection(sphere_source.GetOutputPort())

actor = vtk.vtkActor()
actor.SetMapper(mapper)
renderer.AddActor(actor)

# Optionally configure parallel projection or camera settings for better performance
camera = renderer.GetActiveCamera()
camera.SetParallelProjection(False)  # switch between perspective and parallel as needed

# Initialize and start the interaction loop
render_window.Render()
interactor.Initialize()
interactor.Start()

I. Parallel Settings: The render_window.SetNumberOfLayers(2) call hints that we want at least two rendering layers, which can be exploited in multi-GPU scenarios for compositing.
II. Basic Setup: We add a single vtkSphereSource for demonstration. In a real parallel rendering setup, each node or GPU could handle different parts of the scene or data.
III. Scalability: For complicated scenes, each GPU or node could render its portion, and VTK (or additional compositing libraries) can merge the results into a final image.

Concepts of MPI

MPI (Message Passing Interface) is a standardized, portable, and language-independent message-passing system designed to function on a wide variety of parallel computing architectures. It is widely used in high-performance computing (HPC) to enable multiple processes to coordinate and share workloads across distributed systems or multi-core architectures. This section provides an overview of the core MPI concepts and highlights how they relate to VTK (Visualization Toolkit) and parallel rendering strategies.

I. Processes

In MPI, the basic unit of computation is the process. Each MPI process has its own:

Address space: Memory is isolated, which avoids accidental overwriting of another process’s data.
Execution flow: Each process runs independently but can coordinate via MPI routines.

II. Communicator

A communicator is an MPI construct that specifies a group of processes that can communicate with each other. The most common communicator is MPI_COMM_WORLD, which includes all processes in the MPI job. However, you can create custom communicators for more specialized communication patterns, for example:

Sub-communicators for subsets of processes that share certain tasks.
Split communicators to separate processes according to specific roles or topological constraints.

III. Rank

Each process in an MPI communicator has a unique rank, an integer identifier ranging from 0 to size - 1, where size is the total number of processes in the communicator.

Rank 0 is often called the root process in collective operations.
The rank is used to address messages to and from other processes.

IV. Point-to-Point Communication

MPI supports direct communication between pairs of processes via point-to-point routines, enabling explicit message passing. Common functions include:

MPI_Send: Sends a message from one process to another (blocking send).
MPI_Recv: Receives a message from another process (blocking receive).
There are also non-blocking equivalents (MPI_Isend, MPI_Irecv) that allow further computation while communication is in progress.

V. Collective Communication

Collective communication functions involve all processes in a communicator, which is particularly useful for tasks like broadcasting, gathering, or reducing data:

MPI_Bcast: Broadcasts a message from a root process to all other processes.
MPI_Reduce: Collects and combines values (e.g., summation) from all processes and returns the result to a designated root.
MPI_Gather: Gathers values from all processes into one process.
MPI_Scatter: Distributes chunks of an array from one process to all other processes.

VI. Synchronization
MPI offers mechanisms for synchronizing processes: - MPI_Barrier: All processes in the communicator wait at the barrier until every process has reached it, ensuring a consistent execution point across processes.

VII. Derived Data Types

For sending complicated data structures (e.g., mixed arrays, structs), MPI allows the creation of derived data types:

Enables sending/receiving custom or non-contiguous data in a single MPI call.
Reduces overhead by avoiding multiple send/receive calls for complicated data.

VIII. Virtual Topologies

MPI can define logical layouts or topologies (Cartesian, graph-based) for mapping processes onto specific communication patterns:

Cartesian Topology: Common in multi-dimensional grid problems, like fluid simulation or image processing.
Graph Topology: Useful for irregular communication patterns.

IX. Error Handling

MPI includes error-handling mechanisms to manage or ignore errors gracefully:

Default behavior typically aborts the entire MPI program.
Custom error handlers allow for more sophisticated error recovery strategies.

A Simple `mpi4py` Example in Python

While MPI is available for C, C++, and Fortran, Python developers often use mpi4py, a Pythonic interface to MPI. Here is a minimal example illustrating basic MPI usage:

from mpi4py import MPI

# Initialize the MPI environment
MPI.Init()

# Obtain the global communicator
comm = MPI.COMM_WORLD

# Get the rank (ID) of the current process
rank = comm.Get_rank()

# Get the total number of processes
size = comm.Get_size()

# Print a simple message from each process
print(f"Hello from process {rank} of {size}")

# Finalize the MPI environment
MPI.Finalize()

MPI.Init() sets up the MPI environment.
MPI.COMM_WORLD returns the default communicator containing all processes.
comm.Get_rank() and comm.Get_size() allow you to identify each process and the total process count.
MPI.Finalize() ensures that any outstanding communications complete and MPI resources are released.

Primary Classes in VTK for Parallelism

When integrating MPI with VTK to tackle large-scale visualization problems, two necessary classes often come into play:

I. vtkParallelRenderManager

Used to manage and coordinate parallel rendering.
Its purpose is to distribute rendering tasks among multiple processes and composite partial results into a final image.
It is commonly used in large-scale scientific visualization, such as computational fluid dynamics (CFD), molecular dynamics, or volume rendering.
The rendering tasks are divided across processes or GPUs to handle large datasets efficiently.
Intermediate images or geometry are collected from each process after their local rendering tasks.
Partial results are composited into a final scene that is displayed or saved as an image.

Here is a minimal example of setting up a vtkParallelRenderManager:

import vtk

# Create a render window
renderWindow = vtk.vtkRenderWindow()

# Instantiate the parallel render manager and link it to the window
renderManager = vtk.vtkParallelRenderManager()
renderManager.SetRenderWindow(renderWindow)

# Initialize MPI controller
controller = vtk.vtkMPIController()
controller.Initialize()
renderManager.SetController(controller)

# Create a renderer and add to the render window
renderer = vtk.vtkRenderer()
renderWindow.AddRenderer(renderer)

# (Optional) Add some actors, e.g., a simple sphere
sphereSource = vtk.vtkSphereSource()
mapper = vtk.vtkPolyDataMapper()
mapper.SetInputConnection(sphereSource.GetOutputPort())
actor = vtk.vtkActor()
actor.SetMapper(mapper)
renderer.AddActor(actor)

# Render the scene in parallel
renderWindow.Render()

Note: In a real parallel environment (e.g., an HPC cluster), each process runs an instance of this code. The vtkMPIController and vtkParallelRenderManager coordinate tasks among them.

II. vtkMPIController

Serves as an interface between VTK and the MPI communication layer.
It manages the initialization and finalization of MPI to set up and terminate communication.
Point-to-point communication is used for exchanging data directly between processes.
Collective communication handles operations like broadcasts, gathers, or reductions across multiple processes.
It facilitates data distribution or synchronization tasks in parallel processing environments.
The controller coordinates with the parallel rendering manager to ensure consistent rendering across processes.

Below is a simplified structure of a VTK application that employs MPI:

from mpi4py import MPI
import vtk

# Initialize MPI
MPI.Init()

# Create and setup MPI controller
controller = vtk.vtkMPIController()
controller.Initialize()

# Create a render window and associated renderer
renderWindow = vtk.vtkRenderWindow()
renderer = vtk.vtkRenderer()
renderWindow.AddRenderer(renderer)

# Create and setup parallel render manager
renderManager = vtk.vtkParallelRenderManager()
renderManager.SetRenderWindow(renderWindow)
renderManager.SetController(controller)

# Add some geometry to render
coneSource = vtk.vtkConeSource()
coneSource.SetResolution(30)

mapper = vtk.vtkPolyDataMapper()
mapper.SetInputConnection(coneSource.GetOutputPort())

actor = vtk.vtkActor()
actor.SetMapper(mapper)
renderer.AddActor(actor)

# Perform the parallel render
renderWindow.Render()

# Finalize MPI
MPI.Finalize()

By using vtkMPIController, each MPI process can coordinate how data is partitioned, communicated, and combined into the final visualization.

Contextual Configuration and Use Cases

Whether or not you need MPI-based parallel rendering in VTK depends on your deployment context:

Context	Configuration
Single Processor (Local Machine)	- Standard VTK rendering (no MPI needed). - Sufficient for small datasets or interactive demos on a single workstation.
Multi-core Machine (Shared Memory)	- Can still use MPI across cores, but often shared-memory parallelism (like threading with TBB, OpenMP, or Python multiprocessing) may suffice. - For truly large data, MPI + `vtkParallelRenderManager` can be beneficial.
Distributed System (Cluster/HPC)	- Full MPI usage is required to span multiple nodes. - `vtkMPIController` for communication and `vtkParallelRenderManager` for distributing/rendering the final image. - Must handle data partitioning and load balancing.

Practical Considerations

When scaling your application to multiple nodes or a large number of processes, pay attention to:

Load Balancing

Static load balancing assigns tasks or data subsets to processes before execution, ensuring each node gets a predetermined workload (e.g., a simple block-based decomposition of a mesh). This approach is straightforward to carry out and works well when the workload is uniformly distributed or predictable. However, it might lead to underutilized nodes if the actual workload deviates significantly from the initial estimate.
Dynamic load balancing continuously or periodically monitors each node’s workload during runtime and redistributes tasks if some nodes finish earlier or encounter less complicated geometry. This can help maintain balanced use of resources across the cluster, improving overall throughput. However, dynamic approaches often introduce additional overhead for monitoring and redistributing work, so they should be used judiciously, especially when the cost of redistribution may outweigh its benefits.

Data Distribution

Partitioning involves dividing the dataset (e.g., a volume or mesh) into smaller logical or spatial subsets. The goal is to reduce the amount of inter-process communication needed while still allowing each node to work efficiently on its portion of the data. Careful partitioning can significantly improve scalability, but poor partition strategies can lead to load imbalance or increased network traffic.
Replication stores the entire dataset on each node, drastically reducing communication overhead since all the necessary data is local. However, this method is memory-intensive, making it feasible only for smaller datasets or systems with very large memory capacities.
Decomposition provides a compromise by distributing only subsets of the dataset (e.g., sub-volumes, partial meshes) to each node. This method optimizes memory usage and potentially balances computational loads, but requires effective communication mechanisms when data or dependencies span multiple nodes. If not managed properly, communication can become a bottleneck and offset the benefits of decomposition.
Data locality focuses on placing data physically close to the nodes that process it. This can be achieved through careful assignment of tasks to nodes with corresponding data or by employing hardware features like non-uniform memory access (NUMA) controls. Proper data locality helps reduce network latency and improves performance, especially for data-intensive computations.

Synchronization

Barriers (e.g., MPI_Barrier) make sure that all processes reach a certain point in the code before proceeding. This can be useful for maintaining consistent states across processes, such as at important algorithm phases or before collective I/O. Overusing barriers, however, can degrade performance by forcing faster processes to wait for slower ones.
Collective Operations (e.g., MPI_Bcast, MPI_Reduce) provide coordinated and efficient ways to share or aggregate data among all processes. These operations are heavily optimized in most MPI implementations, but must be used carefully to avoid excessive synchronization overhead.
Race Conditions occur when two or more processes attempt to modify shared data structures simultaneously in an uncontrolled manner. To avoid this, make sure proper synchronization through locking mechanisms, atomic operations, or careful data partitioning that prevents unwanted concurrent writes.

Error Handling

Default Error Handler in MPI typically aborts the entire job if an error occurs in any process. This ensures fast failure recognition but offers no opportunity for partial recovery or logging intermediate results.
Custom Error Handlers can be set to catch and handle errors more gracefully. They allow you to log the error, attempt partial computations, or even dynamically reassign tasks if a process fails. This can be important in large-scale systems where maintaining some level of progress is preferable to aborting the entire job due to a single failure.

Detailed Example with Tasks, Distribution, and Rendering

Below is a more in-depth illustrative example combining MPI for both task distribution and VTK parallel rendering:

from mpi4py import MPI
import vtk
import time

# ----------------------------------------
# 1. MPI Initialization
# ----------------------------------------
MPI.Init()

# Create the MPI controller and initialize
controller = vtk.vtkMPIController()
controller.Initialize()

# Obtain local rank (process ID) and total number of processes
rank = controller.GetLocalProcessId()
num_procs = controller.GetNumberOfProcesses()

# ----------------------------------------
# 2. Setup Render Window & Parallel Manager
# ----------------------------------------
render_window = vtk.vtkRenderWindow()

# Create a renderer and add it to the window
renderer = vtk.vtkRenderer()
render_window.AddRenderer(renderer)

# Instantiate the parallel render manager
render_manager = vtk.vtkParallelRenderManager()
render_manager.SetRenderWindow(render_window)
render_manager.SetController(controller)

# ----------------------------------------
# 3. Example Task Distribution
# ----------------------------------------
tasks = None

# Let the root (rank 0) process create a list of tasks
if rank == 0:
    tasks = list(range(16))  # Example: 16 tasks in total

# Broadcast the number of tasks per process
tasks_per_proc = len(tasks) // num_procs if rank == 0 else None
tasks_per_proc = controller.Broadcast(tasks_per_proc, 0)

# Prepare local slice of tasks
local_tasks = []
if rank == 0:
    for proc_id in range(num_procs):
        start_idx = proc_id * tasks_per_proc
        end_idx = start_idx + tasks_per_proc
        sub_tasks = tasks[start_idx:end_idx]
        if proc_id == 0:
            local_tasks = sub_tasks
        else:
            controller.Send(sub_tasks, proc_id, 1234)
else:
    local_tasks = controller.Receive(source=0, tag=1234)

print(f"[Rank {rank}] has tasks: {local_tasks}")

# Simulate doing work on the local tasks
for task in local_tasks:
    time.sleep(0.1)  # Example: replace with real computation

# Synchronize all processes
controller.Barrier()

# ----------------------------------------
# 4. Data Distribution & Rendering Setup
# ----------------------------------------
# Each process creates a sphere with rank-dependent resolution and position
sphere_source = vtk.vtkSphereSource()
sphere_source.SetCenter(rank * 2.0, 0, 0)  # Offset each sphere
sphere_source.SetRadius(0.5)
sphere_source.SetThetaResolution(8 + rank * 2)
sphere_source.SetPhiResolution(8 + rank * 2)

# Build mapper & actor for this local piece
mapper = vtk.vtkPolyDataMapper()
mapper.SetInputConnection(sphere_source.GetOutputPort())

actor = vtk.vtkActor()
actor.SetMapper(mapper)

# Add the local actor to the renderer
renderer.AddActor(actor)

# Optionally, rank 0 sets the camera
if rank == 0:
    camera = renderer.GetActiveCamera()
    camera.SetPosition(0, 0, 20)
    camera.SetFocalPoint(0, 0, 0)

# Synchronize all processes before rendering
controller.Barrier()

# ----------------------------------------
# 5. Parallel Rendering
# ----------------------------------------
render_window.Render()

# Optionally, save screenshots on rank 0
if rank == 0:
    w2i = vtk.vtkWindowToImageFilter()
    w2i.SetInput(render_window)
    w2i.Update()

    writer = vtk.vtkPNGWriter()
    writer.SetFileName("parallel_render_output.png")
    writer.SetInputConnection(w2i.GetOutputPort())
    writer.Write()
    print("[Rank 0] Saved parallel_render_output.png")

# Final synchronization before exit
controller.Barrier()

# ----------------------------------------
# 6. MPI Finalization
# ----------------------------------------
MPI.Finalize()

MPI and VTK setup involve initializing MPI using mpi4py and a VTK MPI controller for managing parallel tasks.
Rank 0 is responsible for subdividing a list of tasks and distributing subsets to appropriate ranks using scatter functionality.
Each process performs local computations, simulating tasks such as data processing, simulation steps, or geometry generation.
Data partitioning is demonstrated as each rank creates or loads a sphere offset along the x-axis to visualize parallel rendering.
The parallel render manager ensures proper display coordination of all actors from different processes.
Rank 0 has the option to save a screenshot of the rendered scene after the parallel rendering process.
Synchronization between processes is achieved using controller.Barrier() to ensure all ranks proceed together at critical points.
Closing of MPI is scheduled after rendering to gracefully terminate the parallel environment.

Performance Optimization and Parallelism

Level of Detail (LOD)
1. Classes Associated with LOD
2. Example of Creating a vtkLODActor
Culling
1. Key Classes for Culling
2. Example of Using vtkFrustumCuller
Parallel Rendering and Processing