Wednesday, September 30, 2015

Introduction to i.MX6Q/D (GC2000) Vivante OpenCL Embedded Profile

The purpose of this post is not make you a new OpenCL expert, but provide you the basic knowledge to take advantage of the i.MX6’s GPGPU support and get your code (or part of it) accelerated by its Graphics Processing Unit.


First of All, what is GPGPU and OpenCL ?



GPGPU:


       Stands for General Purpose Graphics Processing Unit

       Algorithms well-suited to GPGPU implementation are those that exhibit two properties: they are data parallel and throughput  intensive

       Data parallel: means that a processor can execute the operation on different data elements simultaneously.

       Throughput intensive: means that the algorithm is going to process lots of data elements, so there will be plenty to operate on in parallel.

       Pixel-based applications such as computer vision and video and image processing are very well suited to GPGPU  technology, and for this reason, many of the commercial software packages in these areas now include GPGPU acceleration


OpenCL


       Open Computing Language (OpenCL) is a framework for writing programs that execute across heterogeneous platforms consisting of central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field-programmable gate arrays (FPGAs) and other processors.

       OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus application programming interfaces (APIs) that are used to define and then control the platforms

       OpenCL provides parallel computing using task-based and data-based parallelism.

       OpenCL is an open standard maintained by the non-profit technology consortium Khronos Group.

       Apple, Intel, Qualcomm, Advanced Micro Devices (AMD), Nvidia, Altera, Samsung, Vivante and ARM Holdings have adopted it.

  
There are A LOT of OpenCL tutorials on the web explaining all its concepts and capabilities. Below you will find only the most important ones:


Introduction to OpenCL


       In order to visualize the heterogeneous architecture in terms of the API and restrict memory usage for parallel execution, OpenCL defines multiple cascading layers of virtual hardware definitions

       The basic execution engine that runs the kernels is called a Processing Element (PE)

       A group of Processing Elements is called a Compute Unit (CU)

       Finally, a group of Compute Unit is called Compute Device.


       A host system could interact with multiple Compute Devices on a system (e.g., a GPGPU and a DSP), but data sharing and synchronization is coarsely defined at this level. 




       Each item a kernel works on is called a 'work item'.

       A simple example of this is determining the color of a single pixel (work-item) in an output image.

       Work-items are grouped into 'work-groups' , which are each executed in parallel to speed up calculation performance

       How big a work-group is depends on the algorithm being executed and the dimensions of the data being processed (e.g. one work-item per pixel for a block of pixels in a filter) 





OpenCL runs in a 'data parallel’ programming model where the kernels run once for each item in an 'index space‘. The dimensionality of the data being processed (e.g., 1, 2, or 3 dimension arrays; called NDRange or N-dimensional range).



Freescale’s i.MX6Q/D GPU (GC2000) OpenCL EP features


       Vivante GC2000 GPGPU capable of running OpenCL 1.1 Embedded Profile (EP)

       OpenCL embedded profile capabilities (that means for instance no atomic variables, does not mandate support for 3D Images,  64 bit integers or double precision floating point numbers)

       4xSIMD cores (vec-4) shader units

       Up to 512 general purpose registers 128b each for each of the cores

        Maximum number of instructions for kernels is 512

       1-cycle throughput for all shader instructions

        L1 cache of 4KB

        Uniform registers 168 for vertex shader and 64 for fragment shader

        Single integer pipeline/core

       In OpenCL Embedded Profile, the requirements requirements for samplers are reduced, with the number of samplers decreased from 16 (FP – Full Profile) to 8 (EP), and the math precision (ULP) is slightly relaxed below the IEEE-754 specification for some functions

       Lastly, in OpenCL EP the minimum image size is reduced to 2048 (from 8192 in FP) and the local memory requirement is reduced to 1KB (from 32KB in FP)





Each of the shader cores function as a CU. The cores are a native Vec4 ISA, thus the preferred vector width for all primitives 4.





Code Optimization for Freescale’s i.MX6Q/D OpenCL EP


       Vector math inputs in multiples of 4.

     As mentioned previously, the GC2000 in i.MX 6Q is a vec4 floating point SIMD engine, so vector math always prefers 4 inputs (or a multiple of 4) for maximum math throughput.


       Use full 32 bit native registers for math.

     Both integer and floating point math is natively 32 bit. 8 and 16bit primitives will still use 32 bit registers, so there is no gain (for the math computation) in going with lower sizes.

       Use floating point instead of integer formats

     1x 32-bit Integer pipeline (supports 32-bit INT formats in hardware, 8-bit/16-bit done in software)
     4x 32-bit Floating Point pipeline (supports 16-bit and 32-bit FP formats in hardware)


       To maximize OpenCL compute efficiency, it is better to convert integer formats to floating point to utilize the four (4) parallel FP math units.


       Use 16-bit integer division and 32-bit for other integer math operations

     For integer math (excluding division), there is one 32-bit integer adder and one 32-bit integer multiplier per core. If integer formats are required, use 32 bit integer formats for addition, multiplication, mask, and sin extensions.
      Try to minimize or not use 8-bit or 16-bit integer formats since they will be calculated in software and the 32-bit INT ALU will not be used.
     Integer division: iMX 6Q hardware supports only 16-bit integer division, and software is used to implement 8-bit and 32-bit division.
     It is better to use 16-bit division if possible. There will be a performance penalty if 32-bit division is used.


       Use Round to Zero mode

     Floating point computation supports “round-to-zero” only (round-to-nearest-even is not required for EP, if round-to-zero is supported).


       Data accesses should be 16B

     For the most efficient use of the GPGPU’s L1 cache.
     Work-group size should be a multiple of thread-group size.
     Work-group size should be an integer multiple of the GPU's internal preferred work-group size (16 for GC2000) for optimum hardware usage.


       Keep Global work size at 64K (maximum) per dimension

     Since global IDs for work-items are 16 bits, it is necessary to keep the global work size within 64K (65,536 or 216) per dimension.


       Conditional Branching should be avoided if possible

      Branch penalties depend on the percentage of work-items that go down the branch paths.



This post is long enough for just an “introductory” information about i.MX6Q/D OpenCL EP, for more information including a sample application, take a look on this good white paper provided by Freescale: https://community.freescale.com/docs/DOC-100694


EOF !




8 comments:

  1. Any idea, why the OpenCL local memory is not mapped to HW's local memory (even if it is just 1k anyway)?

    ReplyDelete
    Replies
    1. From Vivante's OpenCL EP documentation:

      -----------------------------------
      Using local memory typically is an order of magnitude faster than accessing host memory through global memory (RAM). However, execution cores do not directly access local memory; instead, they issue memory requests through dedicated hardware units. When a work-item tries to access local memory, the work-item is transferred to the appropriate fetch unit. The work-item then is deactivated until the access unit finishes accessing local memory.
      Select Vivante cores include local storage registers in the hardware Local storage registers are 16 bytes each and are shared across all work items within a work group.
      The total number of local storage registers used by a work group will determine the total number of work groups that can be run concurrently in GPGPU. Having more work groups, allocated concurrently in the GPGPU, generally provides better throughput.
      -----------------------------------

      regards,
      Andre

      Delete
  2. Thanks for you answer, Andre! I much appreciate it!

    However, do I understand correctly, that using local memory in OpenCL program running on GC2000 would benefit from it? I recall from some site, that it was not advisable to use local memory as it basically used the system memory in the background.

    Where can I find this Vivante documentation? I tried to request documentation from Vivante at some point, but received nothing in answer. I feel I'd definitely should've seen that documentation before I made image signal processing pipeline with OpenCL on GC2000 :)

    And I should've read your blog to know that image2d was not going to make things fly... :)

    ReplyDelete
  3. Sorry, but I can't share the documentation, it is Vivante's confidential, but you can download the Graphics user guide at NXP website, it has a lot of information.

    Regards,
    Andre

    ReplyDelete
  4. Sorry, but I can't share the documentation, it is Vivante's confidential, but you can download the Graphics user guide at NXP website, it has a lot of information.

    Regards,
    Andre

    ReplyDelete
  5. I can't seem to find the document from NXP (I do have account there) by searching Graphics User Guide i.MX6. I've found Graphics Development on the i.MX 6 Series doc with Vivante optimization tips for OpenGL, OpenVG and OpenCL. However, it is bit short on hard details, though helpful.

    Could you give the document number that I can use to find it in NXP, please?

    ReplyDelete
  6. Ok, I've found the document. IMX6GRAPHICUG. I have to study the doc.

    Thanks for your help, Andre!

    ReplyDelete
    Replies
    1. Awesome !! I had the document in my hands but I am sad I couldn't share it. I am glad you found a version that was avaiable in public !

      Cheers !

      Delete