Low-Level CUDA Support¶
Device management¶
-
class
cupy.cuda.
Device
¶ Object that represents a CUDA device.
This class provides some basic manipulations on CUDA devices.
It supports the context protocol. For example, the following code is an example of temporarily switching the current device:
with Device(0): do_something_on_device_0()
After the with statement gets done, the current device is reset to the original one.
Parameters: device (int or cupy.cuda.Device) – Index of the device to manipulate. Be careful that the device ID (a.k.a. GPU ID) is zero origin. If it is a Device object, then its ID is used. The current device is selected by default. Variables: id (int) – ID of this device. -
__eq__
¶ x.__eq__(y) <==> x==y
-
__ge__
¶ x.__ge__(y) <==> x>=y
-
__gt__
¶ x.__gt__(y) <==> x>y
-
__int__
¶
-
__le__
¶ x.__le__(y) <==> x<=y
-
__long__
¶
-
__lt__
¶ x.__lt__(y) <==> x<y
-
__ne__
¶ x.__ne__(y) <==> x!=y
-
__repr__
¶
-
compute_capability
¶ Compute capability of this device.
The capability is represented by a string containing the major index and the minor index. For example, compute capability 3.5 is represented by the string ‘35’.
-
cublas_handle
¶ The cuBLAS handle for this device.
The same handle is used for the same device even if the Device instance itself is different.
-
cusolver_handle
¶ The cuSOLVER handle for this device.
The same handle is used for the same device even if the Device instance itself is different.
-
synchronize
()¶ Synchronizes the current thread to the device.
-
use
()¶ Makes this device current.
If you want to switch a device temporarily, use the with statement.
-
Memory management¶
-
class
cupy.cuda.
Memory
¶ Memory allocation on a CUDA device.
This class provides an RAII interface of the CUDA memory allocation.
Parameters: - device (cupy.cuda.Device) – Device whose memory the pointer refers to.
- size (int) – Size of the memory allocation in bytes.
-
__int__
¶ Returns the pointer value to the head of the allocation.
-
__long__
¶
-
class
cupy.cuda.
MemoryPointer
¶ Pointer to a point on a device memory.
An instance of this class holds a reference to the original memory buffer and a pointer to a place within this buffer.
Parameters: Variables: - device (cupy.cuda.Device) – Device whose memory the pointer refers to.
- mem (Memory) – The device memory buffer.
- ptr (int) – Pointer to the place within the buffer.
-
__add__
¶ Adds an offset to the pointer.
-
__iadd__
¶ Adds an offset to the pointer in place.
-
__int__
¶ Returns the pointer value.
-
__isub__
¶ Subtracts an offset from the pointer in place.
-
__long__
¶
-
__radd__
¶ x.__radd__(y) <==> y+x
-
__rsub__
¶ x.__rsub__(y) <==> y-x
-
__sub__
¶ Subtracts an offset from the pointer.
-
copy_from
()¶ Copies a memory sequence from a (possibly different) device or host.
This function is a useful interface that selects appropriate one from
copy_from_device()
andcopy_from_host()
.Parameters: - mem (ctypes.c_void_p or cupy.cuda.MemoryPointer) – Source memory pointer.
- size (int) – Size of the sequence in bytes.
-
copy_from_async
()¶ Copies a memory sequence from an arbitrary place asynchronously.
This function is a useful interface that selects appropriate one from
copy_from_device_async()
andcopy_from_host_async()
.Parameters: - mem (ctypes.c_void_p or cupy.cuda.MemoryPointer) – Source memory pointer.
- size (int) – Size of the sequence in bytes.
- stream (cupy.cuda.Stream) – CUDA stream.
-
copy_from_device
()¶ Copies a memory sequence from a (possibly different) device.
Parameters: - src (cupy.cuda.MemoryPointer) – Source memory pointer.
- size (int) – Size of the sequence in bytes.
-
copy_from_device_async
()¶ Copies a memory from a (possibly different) device asynchronously.
Parameters: - src (cupy.cuda.MemoryPointer) – Source memory pointer.
- size (int) – Size of the sequence in bytes.
- stream (cupy.cuda.Stream) – CUDA stream.
-
copy_from_host
()¶ Copies a memory sequence from the host memory.
Parameters: - mem (ctypes.c_void_p) – Source memory pointer.
- size (int) – Size of the sequence in bytes.
-
copy_from_host_async
()¶ Copies a memory sequence from the host memory asynchronously.
Parameters: - mem (ctypes.c_void_p) – Source memory pointer. It must be a pinned memory.
- size (int) – Size of the sequence in bytes.
- stream (cupy.cuda.Stream) – CUDA stream.
-
copy_to_host
()¶ Copies a memory sequence to the host memory.
Parameters: - mem (ctypes.c_void_p) – Target memory pointer.
- size (int) – Size of the sequence in bytes.
-
copy_to_host_async
()¶ Copies a memory sequence to the host memory asynchronously.
Parameters: - mem (ctypes.c_void_p) – Target memory pointer. It must be a pinned memory.
- size (int) – Size of the sequence in bytes.
- stream (cupy.cuda.Stream) – CUDA stream.
-
memset
()¶ Fills a memory sequence by constant byte value.
Parameters:
-
memset_async
()¶ Fills a memory sequence by constant byte value asynchronously.
Parameters: - value (int) – Value to fill.
- size (int) – Size of the sequence in bytes.
- stream (cupy.cuda.Stream) – CUDA stream.
-
cupy.cuda.
alloc
()¶ Calls the current allocator.
Use
set_allocator()
to change the current allocator.Parameters: size (int) – Size of the memory allocation. Returns: Pointer to the allocated buffer. Return type: MemoryPointer
-
cupy.cuda.
set_allocator
()¶ Sets the current allocator.
Parameters: allocator (function) – CuPy memory allocator. It must have the same interface as the cupy.cuda.alloc()
function, which takes the buffer size as an argument and returns the device buffer of that size.
-
class
cupy.cuda.
MemoryPool
¶ Memory pool for all devices on the machine.
A memory pool preserves any allocations even if they are freed by the user. Freed memory buffers are held by the memory pool as free blocks, and they are reused for further memory allocations of the same sizes. The allocated blocks are managed for each device, so one instance of this class can be used for multiple devices.
Note
When the allocation is skipped by reusing the pre-allocated block, it does not call
cudaMalloc
and therefore CPU-GPU synchronization does not occur. It makes interleaves of memory allocations and kernel invocations very fast.Note
The memory pool holds allocated blocks without freeing as much as possible. It makes the program hold most of the device memory, which may make other CUDA programs running in parallel out-of-memory situation.
Parameters: allocator (function) – The base CuPy memory allocator. It is used for allocating new blocks when the blocks of the required size are all in use. -
free_all_blocks
()¶ Release free blocks.
-
free_all_free
()¶ Release free blocks.
-
malloc
()¶ Allocates the memory, from the pool if possible.
This method can be used as a CuPy memory allocator. The simplest way to use a memory pool as the default allocator is the following code:
set_allocator(MemoryPool().malloc)
Parameters: size (int) – Size of the memory buffer to allocate in bytes. Returns: Pointer to the allocated buffer. Return type: MemoryPointer
-
Streams and events¶
-
class
cupy.cuda.
Stream
(null=False, non_blocking=False)[source]¶ CUDA stream.
This class handles the CUDA stream handle in RAII way, i.e., when an Stream instance is destroyed by the GC, its handle is also destroyed.
Parameters: Variables: ptr (cupy.cuda.runtime.Stream) – Raw stream handle. It can be passed to the CUDA Runtime API via ctypes.
-
add_callback
(callback, arg)[source]¶ Adds a callback that is called when all queued work is done.
Parameters:
-
done
¶ True if all work on this stream has been done.
-
record
(event=None)[source]¶ Records an event on the stream.
Parameters: event (None or cupy.cuda.Event) – CUDA event. If None
, then a new plain event is created and used.Returns: The recorded event. Return type: cupy.cuda.Event See also
-
wait_event
(event)[source]¶ Makes the stream wait for an event.
The future work on this stream will be done after the event.
Parameters: event (cupy.cuda.Event) – CUDA event.
-
-
class
cupy.cuda.
Event
(block=False, disable_timing=False, interprocess=False)[source]¶ CUDA event, a synchronization point of CUDA streams.
This class handles the CUDA event handle in RAII way, i.e., when an Event instance is destroyed by the GC, its handle is also destroyed.
Parameters: - block (bool) – If
True
, the event blocks on thesynchronize()
method. - disable_timing (bool) – If
True
, the event does not prepare the timing data. - interprocess (bool) – If
True
, the event can be passed to other processes.
Variables: ptr (cupy.cuda.runtime.Stream) – Raw stream handle. It can be passed to the CUDA Runtime API via ctypes.
-
done
¶ True if the event is done.
-
record
(stream=None)[source]¶ Records the event to a stream.
Parameters: stream (cupy.cuda.Stream) – CUDA stream to record event. The null stream is used by default. See also
- block (bool) – If
Profiler¶
-
cupy.cuda.
profile
(*args, **kwds)[source]¶ Enable CUDA profiling during with statement.
This function enables profiling on entering a with statement, and disables profiling on leaving the statement.
>>> with cupy.cuda.profile(): ... # do something you want to measure ... pass
-
cupy.cuda.profiler.
initialize
()¶ Initialize the CUDA profiler.
This function initialize the CUDA profiler. See the CUDA document for detail.
Parameters:
-
cupy.cuda.profiler.
start
()¶ Enable profiling.
A user can enable CUDA profiling. When an error occurs, it raises an exception.
See the CUDA document for detail.
-
cupy.cuda.profiler.
stop
()¶ Disable profiling.
A user can disable CUDA profiling. When an error occurs, it raises an exception.
See the CUDA document for detail.
-
cupy.cuda.nvtx.
Mark
()¶ Marks an instantaneous event (marker) in the application.
Markes are used to describe events at a specific time during execution of the application.
Parameters:
-
cupy.cuda.nvtx.
MarkC
()¶ Marks an instantaneous event (marker) in the application.
Markes are used to describe events at a specific time during execution of the application.
Parameters: - message (str) – Name of a marker.
- color (uint32) – Color code for a marker.
-
cupy.cuda.nvtx.
RangePush
()¶ Starts a nestead range.
Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of
RangePush*()
toRangePop()
calls.Parameters:
-
cupy.cuda.nvtx.
RangePushC
()¶ Starts a nestead range.
Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of
RangePush*()
toRangePop()
calls.Parameters: - message (str) – Name of a range.
- color (uint32) – ARGB color for a range.
-
cupy.cuda.nvtx.
RangePop
()¶ Ends a nestead range.
Ranges are used to describe events over a time span during execution of the application. The duration of a range is defined by the corresponding pair of
RangePush*()
toRangePop()
calls.