Low-Level CUDA Support

Device management

class cupy.cuda.Device

Object that represents a CUDA device.

This class provides some basic manipulations on CUDA devices.

It supports the context protocol. For example, the following code is an example of temporarily switching the current device:

with Device(0):
    do_something_on_device_0()

After the with statement gets done, the current device is reset to the original one.

Parameters:device (int or cupy.cuda.Device) – Index of the device to manipulate. Be careful that the device ID (a.k.a. GPU ID) is zero origin. If it is a Device object, then its ID is used. The current device is selected by default.
Variables:id (int) – ID of this device.
__eq__

x.__eq__(y) <==> x==y

__ge__

x.__ge__(y) <==> x>=y

__gt__

x.__gt__(y) <==> x>y

__int__
__le__

x.__le__(y) <==> x<=y

__long__
__lt__

x.__lt__(y) <==> x<y

__ne__

x.__ne__(y) <==> x!=y

__repr__
compute_capability

Compute capability of this device.

The capability is represented by a string containing the major index and the minor index. For example, compute capability 3.5 is represented by the string ‘35’.

cublas_handle

The cuBLAS handle for this device.

The same handle is used for the same device even if the Device instance itself is different.

cusolver_handle

The cuSOLVER handle for this device.

The same handle is used for the same device even if the Device instance itself is different.

synchronize()

Synchronizes the current thread to the device.

use()

Makes this device current.

If you want to switch a device temporarily, use the with statement.

Memory management

class cupy.cuda.Memory

Memory allocation on a CUDA device.

This class provides an RAII interface of the CUDA memory allocation.

Parameters:
  • device (cupy.cuda.Device) – Device whose memory the pointer refers to.
  • size (int) – Size of the memory allocation in bytes.
__int__

Returns the pointer value to the head of the allocation.

__long__
class cupy.cuda.MemoryPointer

Pointer to a point on a device memory.

An instance of this class holds a reference to the original memory buffer and a pointer to a place within this buffer.

Parameters:
  • mem (Memory) – The device memory buffer.
  • offset (int) – An offset from the head of the buffer to the place this pointer refers.
Variables:
  • device (cupy.cuda.Device) – Device whose memory the pointer refers to.
  • mem (Memory) – The device memory buffer.
  • ptr (int) – Pointer to the place within the buffer.
__add__

Adds an offset to the pointer.

__iadd__

Adds an offset to the pointer in place.

__int__

Returns the pointer value.

__isub__

Subtracts an offset from the pointer in place.

__long__
__radd__

x.__radd__(y) <==> y+x

__rsub__

x.__rsub__(y) <==> y-x

__sub__

Subtracts an offset from the pointer.

copy_from()

Copies a memory sequence from a (possibly different) device or host.

This function is a useful interface that selects appropriate one from copy_from_device() and copy_from_host().

Parameters:
copy_from_async()

Copies a memory sequence from an arbitrary place asynchronously.

This function is a useful interface that selects appropriate one from copy_from_device_async() and copy_from_host_async().

Parameters:
copy_from_device()

Copies a memory sequence from a (possibly different) device.

Parameters:
copy_from_device_async()

Copies a memory from a (possibly different) device asynchronously.

Parameters:
copy_from_host()

Copies a memory sequence from the host memory.

Parameters:
  • mem (ctypes.c_void_p) – Source memory pointer.
  • size (int) – Size of the sequence in bytes.
copy_from_host_async()

Copies a memory sequence from the host memory asynchronously.

Parameters:
copy_to_host()

Copies a memory sequence to the host memory.

Parameters:
  • mem (ctypes.c_void_p) – Target memory pointer.
  • size (int) – Size of the sequence in bytes.
copy_to_host_async()

Copies a memory sequence to the host memory asynchronously.

Parameters:
memset()

Fills a memory sequence by constant byte value.

Parameters:
  • value (int) – Value to fill.
  • size (int) – Size of the sequence in bytes.
memset_async()

Fills a memory sequence by constant byte value asynchronously.

Parameters:
  • value (int) – Value to fill.
  • size (int) – Size of the sequence in bytes.
  • stream (cupy.cuda.Stream) – CUDA stream.
cupy.cuda.alloc()

Calls the current allocator.

Use set_allocator() to change the current allocator.

Parameters:size (int) – Size of the memory allocation.
Returns:Pointer to the allocated buffer.
Return type:MemoryPointer
cupy.cuda.set_allocator()

Sets the current allocator.

Parameters:allocator (function) – CuPy memory allocator. It must have the same interface as the cupy.cuda.alloc() function, which takes the buffer size as an argument and returns the device buffer of that size.
class cupy.cuda.MemoryPool

Memory pool for all devices on the machine.

A memory pool preserves any allocations even if they are freed by the user. Freed memory buffers are held by the memory pool as free blocks, and they are reused for further memory allocations of the same sizes. The allocated blocks are managed for each device, so one instance of this class can be used for multiple devices.

Note

When the allocation is skipped by reusing the pre-allocated block, it does not call cudaMalloc and therefore CPU-GPU synchronization does not occur. It makes interleaves of memory allocations and kernel invocations very fast.

Note

The memory pool holds allocated blocks without freeing as much as possible. It makes the program hold most of the device memory, which may make other CUDA programs running in parallel out-of-memory situation.

Parameters:allocator (function) – The base CuPy memory allocator. It is used for allocating new blocks when the blocks of the required size are all in use.
free_all_blocks()

Release free blocks.

free_all_free()

Release free blocks.

malloc()

Allocates the memory, from the pool if possible.

This method can be used as a CuPy memory allocator. The simplest way to use a memory pool as the default allocator is the following code:

set_allocator(MemoryPool().malloc)
Parameters:size (int) – Size of the memory buffer to allocate in bytes.
Returns:Pointer to the allocated buffer.
Return type:MemoryPointer
n_free_blocks()

Count the total number of free blocks.

Returns:The total number of free blocks.
Return type:int

Streams and events

class cupy.cuda.Stream(null=False, non_blocking=False)[source]

CUDA stream.

This class handles the CUDA stream handle in RAII way, i.e., when an Stream instance is destroyed by the GC, its handle is also destroyed.

Parameters:
  • null (bool) – If True, the stream is a null stream (i.e. the default stream that synchronizes with all streams). Otherwise, a plain new stream is created.
  • non_blocking (bool) – If True, the stream does not synchronize with the NULL stream.
Variables:

ptr (cupy.cuda.runtime.Stream) – Raw stream handle. It can be passed to the CUDA Runtime API via ctypes.

add_callback(callback, arg)[source]

Adds a callback that is called when all queued work is done.

Parameters:
  • callback (function) – Callback function. It must take three arguments (Stream object, int error status, and user data object), and returns nothing.
  • arg (object) – Argument to the callback.
done

True if all work on this stream has been done.

record(event=None)[source]

Records an event on the stream.

Parameters:event (None or cupy.cuda.Event) – CUDA event. If None, then a new plain event is created and used.
Returns:The recorded event.
Return type:cupy.cuda.Event
synchronize()[source]

Waits for the stream completing all queued work.

wait_event(event)[source]

Makes the stream wait for an event.

The future work on this stream will be done after the event.

Parameters:event (cupy.cuda.Event) – CUDA event.
class cupy.cuda.Event(block=False, disable_timing=False, interprocess=False)[source]

CUDA event, a synchronization point of CUDA streams.

This class handles the CUDA event handle in RAII way, i.e., when an Event instance is destroyed by the GC, its handle is also destroyed.

Parameters:
  • block (bool) – If True, the event blocks on the synchronize() method.
  • disable_timing (bool) – If True, the event does not prepare the timing data.
  • interprocess (bool) – If True, the event can be passed to other processes.
Variables:

ptr (cupy.cuda.runtime.Stream) – Raw stream handle. It can be passed to the CUDA Runtime API via ctypes.

done

True if the event is done.

record(stream=None)[source]

Records the event to a stream.

Parameters:stream (cupy.cuda.Stream) – CUDA stream to record event. The null stream is used by default.
synchronize()[source]

Synchronizes all device work to the event.

If the event is created as a blocking event, it also blocks the CPU thread until the event is done.

cupy.cuda.get_elapsed_time(start_event, end_event)[source]

Gets the elapsed time between two events.

Parameters:
  • start_event (Event) – Earlier event.
  • end_event (Event) – Later event.
Returns:

Elapsed time in milliseconds.

Return type:

float

Profiler

cupy.cuda.profile(*args, **kwds)[source]

Enable CUDA profiling during with statement.

This function enables profiling on entering a with statement, and disables profiling on leaving the statement.

>>> with cupy.cuda.profile():
...    # do something you want to measure
...    pass
cupy.cuda.profiler.initialize()

Initialize the CUDA profiler.

This function initialize the CUDA profiler. See the CUDA document for detail.

Parameters:
  • config_file (str) – Name of the configuration file.
  • output_file (str) – Name of the coutput file.
  • output_mode (int) – cupy.cuda.profiler.cudaKeyValuePair or cupy.cuda.profiler.cudaCSV.
cupy.cuda.profiler.start()

Enable profiling.

A user can enable CUDA profiling. When an error occurs, it raises an exception.

See the CUDA document for detail.

cupy.cuda.profiler.stop()

Disable profiling.

A user can disable CUDA profiling. When an error occurs, it raises an exception.

See the CUDA document for detail.

cupy.cuda.nvtx.Mark()

Marks an instantaneous event (marker) in the application.

Markes are used to describe events at a specific time during execution of the application.

Parameters:
  • message (str) – Name of a marker.
  • id_color (int) – ID of color for a marker.
cupy.cuda.nvtx.MarkC()

Marks an instantaneous event (marker) in the application.

Markes are used to describe events at a specific time during execution of the application.

Parameters:
  • message (str) – Name of a marker.
  • color (uint32) – Color code for a marker.
cupy.cuda.nvtx.RangePush()

Starts a nestead range.

Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of RangePush*() to RangePop() calls.

Parameters:
  • message (str) – Name of a range.
  • id_color (int) – ID of color for a range.
cupy.cuda.nvtx.RangePushC()

Starts a nestead range.

Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of RangePush*() to RangePop() calls.

Parameters:
  • message (str) – Name of a range.
  • color (uint32) – ARGB color for a range.
cupy.cuda.nvtx.RangePop()

Ends a nestead range.

Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of RangePush*() to RangePop() calls.