Hello to OpenCL

4 min readJan 19, 2021

OpenCL (Open Computing Language) is a framework with standard API and programming language maintained by Khronos Group for parallel computing of heterogeneous devices. It includes the language for writing kernels (functions that execute on OpenCL devices) and APIs that are used to define and then control the platforms.

OpenCL has been adopted by Apple, Intel, AMD, Nvidia and ARM, etc. OpenCL 3.0 is the latest version released in 2020.

Heterogeneous

Now, there is a question: What is Heterogeneous Devices (or we say Heterogeneous Computing)? Literally, it means using more than one kind of processor or cores for a system in reach to solve heavy computing tasks across platforms. This can help the system dynamically interrogate data load and balance across available processors. What specific platforms can we implement this technology? Here comes several examples.

Multi-Core CPU
Many-Core GPGPU
DSP, NPU, FPGA, …

Multi-Core CPU is composed of clusters of cores which is a MIMD machine. It handles the serial or task-parallel workloads in OpenCL.

Many-Core GPGPU is a SIMT machine which can implements OpenCL with the data-parallel workloads.

Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors

Specific accelerators like DSP, NPU and FPGA has VLIW, Data-reuse, Reconfiguration, or other domain-specific accelerations.

To sum up, we said OpenCL has Program Portability because it is a framework for building parallel applications that are portable across heterogeneous platforms.

OpenCL use four theoretical models to organize and describe the whole architecture and we will introduce next by.

Platform Model

OpenCL provides a unified programming interface which has the abstracted platform model and runtime kernel source compilation.

The model includes a host connected to multiple devices. A host is connected to one or more OpenCL devices. Each of the device is collection of one or more compute units. The processing elements compose the compute units and execute code as SIMD or SPMD.

Moreover, OpenCL applications rely on runtime compilation in order to achieve portability. A common usage like LLVM framework can translate the application OpenCL source code into intermediate representation which is an executable binary of the target compute device.

Execution Model

Abstracted Hierarchical System, both the compute hierarchy and memory hierarchy

When OpenCL kernel is submitted for execution by the host, an index spaced is defined. A work-item executes for each point in the index space and lots of work-items are organized into work-group. This N-dimensional index space is called NDRange, where N is 1, 2 or 3. Workload hierarchy has this NDRange to describe global size and local size for work-items.

An OpenCL context is created for workloads (compute devices) on a specific platform (e.g. AMD GPU or CPU). The context can be described as a abstracted container which hides the low level details of different workloads and provides a consistent interface for the OpenCL program to interact with different workloads.

In OpenCL’s abstracted hierarchical system, NDRange workload is mapping to a compute device, a work-group mapping to a compute unit (Synchronization unit), and work-item to a processing element.

Memory Model

As we can see the figure above, the memory model is mainly composed by two parts. The context part is where things become a bit more complex and different with Nvidia’s CUDA. Here we can use more than one device rather than a single device (CUDA can only implemented on Nvidia’s GPU) to implement the computing.

These memory objects can be used by defining the context as an abstraction which has a collection of multiple devices, then, we will have the notion of work groups executing on the processing elements. For example, I can have a set of four groups executing together in one device, another set of four groups in other device.

Programming Model

The OpenCL programming model supports data parallel and task parallel programming models. It also describes the task synchronization primitives.

Data parallel (SIMD) is the simultaneous execution on multiple cores (processing elements) of the same function across the elements of a dataset.
Task parallel (MIMD) is the simultaneous execution on multiple cores of many different functions across the same or different datasets.
These two methods can be hybrid.

Reference

[1] OpenCL Basics, https://sites.google.com/site/csc8820/opencl-basics

[2] OpenCL — Runtime System, https://www.youtube.com/watch?v=NmUAJwPOJ7o&t=217s

[3] An OpenCL Runtime System for a Heterogeneous Many-Core Virtual Platform, https://caslab.ee.ncku.edu.tw/dokuwiki/_media/research:caslab_2014_cnf_03.pdf