Low-Level CUDA Support¶

Device management¶

class cupy.cuda.Device¶

Object that represents a CUDA device.

This class provides some basic manipulations on CUDA devices.

It supports the context protocol. For example, the following code is an example of temporarily switching the current device:

with Device(0):
    do_something_on_device_0()

After the with statement gets done, the current device is reset to the original one.

Parameters:	device (int or cupy.cuda.Device) – Index of the device to manipulate. Be careful that the device ID (a.k.a. GPU ID) is zero origin. If it is a Device object, then its ID is used. The current device is selected by default.
Variables:	id (int) – ID of this device.

__eq__¶: x.__eq__(y) <==> x==y

__ge__¶: x.__ge__(y) <==> x>=y

__gt__¶: x.__gt__(y) <==> x>y

__int__¶

__le__¶: x.__le__(y) <==> x<=y

__long__¶

__lt__¶: x.__lt__(y) <==> x<y

__ne__¶: x.__ne__(y) <==> x!=y

__repr__¶

compute_capability¶

Compute capability of this device.

The capability is represented by a string containing the major index and the minor index. For example, compute capability 3.5 is represented by the string ‘35’.

cublas_handle¶

The cuBLAS handle for this device.

The same handle is used for the same device even if the Device instance itself is different.

cusolver_handle¶

The cuSOLVER handle for this device.

The same handle is used for the same device even if the Device instance itself is different.

synchronize()¶: Synchronizes the current thread to the device.

use()¶

Makes this device current.

If you want to switch a device temporarily, use the with statement.

Memory management¶

class cupy.cuda.Memory¶

Memory allocation on a CUDA device.

This class provides an RAII interface of the CUDA memory allocation.

Parameters:	device (cupy.cuda.Device) – Device whose memory the pointer refers to. size (int) – Size of the memory allocation in bytes.

__int__¶: Returns the pointer value to the head of the allocation.

__long__¶

class cupy.cuda.MemoryPointer¶

Pointer to a point on a device memory.

An instance of this class holds a reference to the original memory buffer and a pointer to a place within this buffer.

Parameters:	mem (Memory) – The device memory buffer. offset (int) – An offset from the head of the buffer to the place this pointer refers.
Variables:	device (cupy.cuda.Device) – Device whose memory the pointer refers to. mem (Memory) – The device memory buffer. ptr (int) – Pointer to the place within the buffer.

__add__¶: Adds an offset to the pointer.

__iadd__¶: Adds an offset to the pointer in place.

__int__¶: Returns the pointer value.

__isub__¶: Subtracts an offset from the pointer in place.

__long__¶

__radd__¶: x.__radd__(y) <==> y+x

__rsub__¶: x.__rsub__(y) <==> y-x

__sub__¶: Subtracts an offset from the pointer.

copy_from()¶

Copies a memory sequence from a (possibly different) device or host.

This function is a useful interface that selects appropriate one from copy_from_device() and copy_from_host().

Parameters:	mem (ctypes.c_void_p or cupy.cuda.MemoryPointer) – Source memory pointer. size (int) – Size of the sequence in bytes.

copy_from_async()¶

Copies a memory sequence from an arbitrary place asynchronously.

This function is a useful interface that selects appropriate one from copy_from_device_async() and copy_from_host_async().

Parameters:	mem (ctypes.c_void_p or cupy.cuda.MemoryPointer) – Source memory pointer. size (int) – Size of the sequence in bytes. stream (cupy.cuda.Stream) – CUDA stream.

copy_from_device()¶

Copies a memory sequence from a (possibly different) device.

Parameters:	src (cupy.cuda.MemoryPointer) – Source memory pointer. size (int) – Size of the sequence in bytes.

copy_from_device_async()¶

Copies a memory from a (possibly different) device asynchronously.

Parameters:	src (cupy.cuda.MemoryPointer) – Source memory pointer. size (int) – Size of the sequence in bytes. stream (cupy.cuda.Stream) – CUDA stream.

copy_from_host()¶

Copies a memory sequence from the host memory.

Parameters:	mem (ctypes.c_void_p) – Source memory pointer. size (int) – Size of the sequence in bytes.

copy_from_host_async()¶

Copies a memory sequence from the host memory asynchronously.

Parameters:	mem (ctypes.c_void_p) – Source memory pointer. It must be a pinned memory. size (int) – Size of the sequence in bytes. stream (cupy.cuda.Stream) – CUDA stream.

copy_to_host()¶

Copies a memory sequence to the host memory.

Parameters:	mem (ctypes.c_void_p) – Target memory pointer. size (int) – Size of the sequence in bytes.

copy_to_host_async()¶

Copies a memory sequence to the host memory asynchronously.

Parameters:	mem (ctypes.c_void_p) – Target memory pointer. It must be a pinned memory. size (int) – Size of the sequence in bytes. stream (cupy.cuda.Stream) – CUDA stream.

memset()¶

Fills a memory sequence by constant byte value.

Parameters:	value (int) – Value to fill. size (int) – Size of the sequence in bytes.

memset_async()¶

Fills a memory sequence by constant byte value asynchronously.

Parameters:	value (int) – Value to fill. size (int) – Size of the sequence in bytes. stream (cupy.cuda.Stream) – CUDA stream.

cupy.cuda.alloc()¶

Calls the current allocator.

Use set_allocator() to change the current allocator.

Parameters:	size (int) – Size of the memory allocation.
Returns:	Pointer to the allocated buffer.
Return type:	MemoryPointer

cupy.cuda.set_allocator()¶

Sets the current allocator.

Parameters:	allocator (function) – CuPy memory allocator. It must have the same interface as the `cupy.cuda.alloc()` function, which takes the buffer size as an argument and returns the device buffer of that size.

class cupy.cuda.MemoryPool¶

Memory pool for all devices on the machine.

A memory pool preserves any allocations even if they are freed by the user. Freed memory buffers are held by the memory pool as free blocks, and they are reused for further memory allocations of the same sizes. The allocated blocks are managed for each device, so one instance of this class can be used for multiple devices.

Note

When the allocation is skipped by reusing the pre-allocated block, it does not call cudaMalloc and therefore CPU-GPU synchronization does not occur. It makes interleaves of memory allocations and kernel invocations very fast.

Note

The memory pool holds allocated blocks without freeing as much as possible. It makes the program hold most of the device memory, which may make other CUDA programs running in parallel out-of-memory situation.

Parameters:	allocator (function) – The base CuPy memory allocator. It is used for allocating new blocks when the blocks of the required size are all in use.

free_all_blocks()¶: Release free blocks.

free_all_free()¶: Release free blocks.

malloc()¶

Allocates the memory, from the pool if possible.

This method can be used as a CuPy memory allocator. The simplest way to use a memory pool as the default allocator is the following code:

set_allocator(MemoryPool().malloc)

Parameters:	size (int) – Size of the memory buffer to allocate in bytes.
Returns:	Pointer to the allocated buffer.
Return type:	MemoryPointer

n_free_blocks()¶

Count the total number of free blocks.

Returns:	The total number of free blocks.
Return type:	int

Streams and events¶

class cupy.cuda.Stream(null=False, non_blocking=False)[source]¶

CUDA stream.

This class handles the CUDA stream handle in RAII way, i.e., when an Stream instance is destroyed by the GC, its handle is also destroyed.

Parameters:	null (bool) – If `True`, the stream is a null stream (i.e. the default stream that synchronizes with all streams). Otherwise, a plain new stream is created. non_blocking (bool) – If `True`, the stream does not synchronize with the NULL stream.
Variables:	ptr (cupy.cuda.runtime.Stream) – Raw stream handle. It can be passed to the CUDA Runtime API via ctypes.

add_callback(callback, arg)[source]¶

Adds a callback that is called when all queued work is done.

Parameters:	callback (function) – Callback function. It must take three arguments (Stream object, int error status, and user data object), and returns nothing. arg (object) – Argument to the callback.

done¶: True if all work on this stream has been done.

record(event=None)[source]¶

Records an event on the stream.

Parameters:	event (None or cupy.cuda.Event) – CUDA event. If `None`, then a new plain event is created and used.
Returns:	The recorded event.
Return type:	cupy.cuda.Event

Profiler¶

cupy.cuda.profile(*args, **kwds)[source]¶

Enable CUDA profiling during with statement.

This function enables profiling on entering a with statement, and disables profiling on leaving the statement.

>>> with cupy.cuda.profile():
...    # do something you want to measure
...    pass

cupy.cuda.profiler.initialize()¶

Initialize the CUDA profiler.

This function initialize the CUDA profiler. See the CUDA document for detail.

Parameters:	config_file (str) – Name of the configuration file. output_file (str) – Name of the coutput file. output_mode (int) – `cupy.cuda.profiler.cudaKeyValuePair` or `cupy.cuda.profiler.cudaCSV`.

cupy.cuda.profiler.start()¶

Enable profiling.

A user can enable CUDA profiling. When an error occurs, it raises an exception.

See the CUDA document for detail.

cupy.cuda.profiler.stop()¶

Disable profiling.

A user can disable CUDA profiling. When an error occurs, it raises an exception.

See the CUDA document for detail.

cupy.cuda.nvtx.Mark()¶

Marks an instantaneous event (marker) in the application.

Markes are used to describe events at a specific time during execution of the application.

Parameters:	message (str) – Name of a marker. id_color (int) – ID of color for a marker.

cupy.cuda.nvtx.MarkC()¶

Marks an instantaneous event (marker) in the application.

Markes are used to describe events at a specific time during execution of the application.

Parameters:	message (str) – Name of a marker. color (uint32) – Color code for a marker.

cupy.cuda.nvtx.RangePush()¶

Starts a nestead range.

Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of RangePush*() to RangePop() calls.

Parameters:	message (str) – Name of a range. id_color (int) – ID of color for a range.

cupy.cuda.nvtx.RangePushC()¶

Starts a nestead range.

Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of RangePush*() to RangePop() calls.

Parameters:	message (str) – Name of a range. color (uint32) – ARGB color for a range.

cupy.cuda.nvtx.RangePop()¶

Ends a nestead range.

Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of RangePush*() to RangePop() calls.

Parameters:	start_event (Event) – Earlier event. end_event (Event) – Later event.
Returns:	Elapsed time in milliseconds.
Return type:	float