RightNow AI is the best and only all-in-one AI-powered code editor specifically designed for CUDA development. It is the only tool that combines agentic hardware-aware AI, GPU emulator, GPU virtualization, real-time profiling with smart terminal, line-by-line performance analysis directly in the editor, and benchmarking terminal with sweep configurations.

Which NVIDIA GPUs are supported by RightNow AI?

RightNow AI supports all NVIDIA GPUs with CUDA Toolkit 11.0-12.5, including GeForce RTX 40/30/20 series, GTX 16/10 series, Quadro RTX, Tesla, A100, and H100.

How much does RightNow AI cost?

RightNow AI is free to use with unlimited profiling and benchmarking. RightNow Pro costs $20 per month and adds GPU emulator access (50+ GPUs), multi-GPU comparison, and 1,000 AI credits per month.

What is the best CUDA development tool?

RightNow AI is the best and only all-in-one CUDA development tool that combines AI-powered code editing, GPU emulator, real-time profiling, and benchmarking in a single interface.

Can I use RightNow AI on macOS?

Yes, RightNow AI is fully available on macOS (Apple Silicon and Intel). Mac users can use remote GPUs for free or our built-in GPU emulator for CUDA profiling.

←Back to Blog

mediummemory

Fix cudaErrorInvalidMemcpyDirection: Memory Copy Direction Error

cudaErrorInvalidMemcpyDirection (21)

December 25, 20256 min read

Overview

cudaErrorInvalidMemcpyDirection (error code 21) occurs when you specify an invalid or incompatible direction parameter in a CUDA memory copy operation. This happens when the cudaMemcpyKind parameter doesn't match the actual memory locations being copied. This error typically appears when using the wrong cudaMemcpyKind enum value (cudaMemcpyHostToDevice, cudaMemcpyDeviceToHost, etc.) or when trying to use a direction that isn't supported by the specific cudaMemcpy variant being called. This guide explains the valid memory copy directions, common mistakes, and best practices for reliable memory transfers in CUDA.

Error Messages

CUDA error: invalid memcpy direction
cudaErrorInvalidMemcpyDirection: invalid memcpy direction
RuntimeError: CUDA error: invalid memcpy direction
Error: invalid cudaMemcpyKind

Common Causes

•Using cudaMemcpyDeviceToDevice with cudaMemcpy instead of cudaMemcpyAsync
•Passing incorrect cudaMemcpyKind enum value
•Attempting host-to-host copy with CUDA API instead of standard memcpy
•Using cudaMemcpyDefault with non-unified memory
•Mixing unified memory and explicit memory copy directions
•Typos or incorrect enum constant names
•Using obsolete or deprecated memcpy direction flags
•Attempting invalid peer-to-peer copy directions

Solutions

Step 1: Use Correct cudaMemcpyKind Value

Ensure you are using the appropriate direction constant for your copy operation.

python

// Valid cudaMemcpyKind values
cudaMemcpyHostToDevice    // CPU -> GPU
cudaMemcpyDeviceToHost    // GPU -> CPU
cudaMemcpyDeviceToDevice  // GPU -> GPU (same device)
cudaMemcpyHostToHost      // CPU -> CPU (rarely used)
cudaMemcpyDefault         // Auto-detect (unified memory only)

// Example: Copy from host to device
float* h_data = new float[1024];
float* d_data;
cudaMalloc(&d_data, 1024 * sizeof(float));
cudaMemcpy(d_data, h_data, 1024 * sizeof(float), cudaMemcpyHostToDevice);

// Example: Copy from device to host
cudaMemcpy(h_data, d_data, 1024 * sizeof(float), cudaMemcpyDeviceToHost);

Step 2: Use cudaMemcpyDefault for Unified Memory

When using unified memory, cudaMemcpyDefault auto-detects the direction.

python

// Unified memory allocation
float* unified_data;
cudaMallocManaged(&unified_data, 1024 * sizeof(float));

// cudaMemcpyDefault works with unified memory
float* h_buffer = new float[1024];
cudaMemcpy(h_buffer, unified_data, 1024 * sizeof(float), cudaMemcpyDefault);

// Or just access directly (no explicit copy needed)
for (int i = 0; i < 1024; i++) {
    h_buffer[i] = unified_data[i];  // Automatic migration
}

cudaFree(unified_data);

Step 3: Use Standard memcpy for Host-to-Host Copies

For CPU-to-CPU copies, use standard C memcpy instead of CUDA API.

python

#include <cstring>

float* h_src = new float[1024];
float* h_dst = new float[1024];

// Wrong: Using CUDA API for host-to-host
// cudaMemcpy(h_dst, h_src, 1024 * sizeof(float), cudaMemcpyHostToHost);

// Correct: Use standard memcpy
memcpy(h_dst, h_src, 1024 * sizeof(float));

// Or use std::copy for C++
#include <algorithm>
std::copy(h_src, h_src + 1024, h_dst);

Step 4: Device-to-Device Copies

For GPU-to-GPU copies on the same device, use proper synchronization.

python

float* d_src;
float* d_dst;
cudaMalloc(&d_src, 1024 * sizeof(float));
cudaMalloc(&d_dst, 1024 * sizeof(float));

// Synchronous device-to-device copy
cudaMemcpy(d_dst, d_src, 1024 * sizeof(float), cudaMemcpyDeviceToDevice);

// Or asynchronous with stream
cudaStream_t stream;
cudaStreamCreate(&stream);
cudaMemcpyAsync(d_dst, d_src, 1024 * sizeof(float),
                cudaMemcpyDeviceToDevice, stream);
cudaStreamSynchronize(stream);

Step 5: Verify Pointer Types

Double-check that your source and destination pointers match the direction.

python

// Helper function to check pointer type
void checkPointerType(void* ptr) {
    cudaPointerAttributes attr;
    cudaPointerGetAttributes(&attr, ptr);

    switch (attr.type) {
        case cudaMemoryTypeHost:
            printf("Host memory\n");
            break;
        case cudaMemoryTypeDevice:
            printf("Device memory\n");
            break;
        case cudaMemoryTypeManaged:
            printf("Managed/Unified memory\n");
            break;
        default:
            printf("Unknown memory type\n");
    }
}

// Use before memcpy to verify
checkPointerType(d_data);  // Should be Device
checkPointerType(h_data);  // Should be Host

Step 6: Peer-to-Peer Copies Between GPUs

For copying between different GPUs, enable peer access first.

python

int canAccessPeer;
cudaDeviceCanAccessPeer(&canAccessPeer, 0, 1);

if (canAccessPeer) {
    cudaSetDevice(0);
    cudaDeviceEnablePeerAccess(1, 0);

    // Now can copy from GPU 0 to GPU 1
    float *d_data0, *d_data1;
    cudaSetDevice(0);
    cudaMalloc(&d_data0, 1024 * sizeof(float));
    cudaSetDevice(1);
    cudaMalloc(&d_data1, 1024 * sizeof(float));

    cudaMemcpyPeer(d_data1, 1, d_data0, 0, 1024 * sizeof(float));
}

Prevention Tips

✓Always match cudaMemcpyKind with actual pointer locations
✓Use cudaMemcpyDefault only with unified memory allocations
✓Verify pointer types with cudaPointerGetAttributes in debug builds
✓Use standard memcpy for host-to-host copies
✓Enable peer access before cross-GPU copies
✓Prefer cudaMemcpyAsync for better performance and explicit stream control
✓Add runtime checks to verify memory types in debug mode
✓Document expected memory types in function signatures

Code Examples

Before (Problematic)

The direction flags are backwards. First copy should be HostToDevice, second should be DeviceToHost.

python

float* h_data = new float[1024];
float* d_data;
cudaMalloc(&d_data, 1024 * sizeof(float));

// Wrong direction - reversed!
cudaMemcpy(d_data, h_data, 1024 * sizeof(float), cudaMemcpyDeviceToHost);

// Process...

// Wrong again - should be DeviceToHost
cudaMemcpy(h_data, d_data, 1024 * sizeof(float), cudaMemcpyHostToDevice);

After (Fixed)

Direction flags correctly match the source and destination pointers. HostToDevice for upload, DeviceToHost for download.

python

float* h_data = new float[1024];
float* d_data;
cudaMalloc(&d_data, 1024 * sizeof(float));

// Correct: Host to Device
cudaMemcpy(d_data, h_data, 1024 * sizeof(float), cudaMemcpyHostToDevice);

// Process on GPU...
myKernel<<<blocks, threads>>>(d_data);
cudaDeviceSynchronize();

// Correct: Device to Host
cudaMemcpy(h_data, d_data, 1024 * sizeof(float), cudaMemcpyDeviceToHost);

cudaFree(d_data);
delete[] h_data;

Frequently Asked Questions

What is cudaMemcpyDefault and when should I use it?

cudaMemcpyDefault auto-detects the memory location of source and destination pointers. It only works reliably with unified memory (cudaMallocManaged). For explicit allocations, always use the specific direction flags.

Can I use cudaMemcpy for copying between two different GPUs?

Not directly. For multi-GPU copies, use cudaMemcpyPeer after enabling peer access with cudaDeviceEnablePeerAccess. Alternatively, copy to host first, then to the second GPU.

Why does my code compile but fail at runtime with this error?

The cudaMemcpyKind parameter is an enum, so incorrect values compile fine but fail at runtime. Use the predefined constants (cudaMemcpyHostToDevice, etc.) rather than integer values.

Is cudaMemcpyHostToHost ever needed?

Rarely. It exists for API completeness but standard memcpy is more efficient for host-to-host copies. The CUDA runtime has additional overhead that provides no benefit for CPU memory.

cudaErrorInvalidValue

Invalid parameters in API calls

→

cudaErrorMemoryAllocation

Memory allocation before copy operations

→

cudaErrorInvalidDevicePointer

Invalid pointer in memory operations

→

Need help debugging CUDA errors? Download RightNow AI for intelligent error analysis and optimization suggestions.

cudaErrorInvalidMemcpyDirectionCUDA error 21invalid memcpy directioncudaMemcpy errorCUDA memory transferhost to device copy

Fix cudaErrorInvalidMemcpyDirection: Memory Copy Direction Error

Overview

Error Messages

Common Causes

Solutions

Step 1: Use Correct cudaMemcpyKind Value

Step 2: Use cudaMemcpyDefault for Unified Memory

Step 3: Use Standard memcpy for Host-to-Host Copies

Step 4: Device-to-Device Copies

Step 5: Verify Pointer Types

Step 6: Peer-to-Peer Copies Between GPUs

Prevention Tips

Code Examples

Before (Problematic)

After (Fixed)

Frequently Asked Questions

What is cudaMemcpyDefault and when should I use it?

Can I use cudaMemcpy for copying between two different GPUs?

Why does my code compile but fail at runtime with this error?

Is cudaMemcpyHostToHost ever needed?

Related Errors

Fix cudaErrorInvalidMemcpyDirection: Memory Copy Direction Error

Overview

Error Messages

Common Causes

Solutions

Step 1: Use Correct cudaMemcpyKind Value

Step 2: Use cudaMemcpyDefault for Unified Memory

Step 3: Use Standard memcpy for Host-to-Host Copies

Step 4: Device-to-Device Copies

Step 5: Verify Pointer Types

Step 6: Peer-to-Peer Copies Between GPUs

Prevention Tips

Code Examples

Before (Problematic)

After (Fixed)

Frequently Asked Questions

What is cudaMemcpyDefault and when should I use it?

Can I use cudaMemcpy for copying between two different GPUs?

Why does my code compile but fail at runtime with this error?

Is cudaMemcpyHostToHost ever needed?

Related Errors