Optimize Your Intensive Computations With Vector API

In this post, we discuss how you can efficiently execute your computationally intensive jobs using the Vector API in Java 24. 

We start with a small introduction to traditional sequential computations compared to modern vector computations. 

What Are Traditional Scalar/Sequential Operations? 

Sequential operations, as the name suggests,  process data one element at a time. Usually they run in a loop, taking the next element in the queue and processing it until the queue is finished. 

What Are Modern Vector Operations? 

Vector operations read the entire queue/array of data at once, and, with a single CPU instruction, they  process all the data in one go.

Vector computations perform operations on entire arrays of data simultaneously, making it much faster than loop-based sequential operations.

For large datasets and for intensive computations, this parallelism in Vector computations leads to significant performance gains. Also, vectorized computations can leverage multiple cores or processors simultaneously for parallel execution.

We will be demonstrating it shortly. 

What Does Java24 Offer For Performing Vector Computations? 

Java 24 has the Vector API to express vector computations that reliably compile at runtime to optimal vector instructions on supported CPU architectures, thereby achieving performance far superior to equivalent scalar computations. This API was originally introduced in Java 16, and is in its Ninth Incubator. So, we can safely assume that it’s quite mature and battle-tested by now.

There is an inbuilt default auto-vectorizer in the JVM which tries to convert normal computations into vector computations if the underlying platform has SIMD(Single Instruction, Multiple Data) or predicate registers in its CPU. But more importantly, programmers can now develop custom vector loops that can express high-performance algorithms, such as vectorized hashCode or specialized array comparisons, which an auto-vectorizer may never optimize. Numerous domains can benefit from this explicit vector API in Java24, including scientific simulations, machine learning, linear algebra, finance, cryptography and code within the JDK itself. 

What Are Some Prerequisites for Executing Vector Operations? 

Some prerequisites you should keep in mind are:- 

  • The data to be operated upon must be stored in contiguous memory locations so that the processor can access it efficiently. That’s the reason why we store the data in a Vector<E> object. 
  • Vector processors have a limited instruction set that’s optimized for numerical computations, so choose your instructions wisely. 
  • The algorithm chosen must be such that it benefits from lanewise Vector operations. You must change your scalar sequential operations to Vector operations. This is simple, though, for a programmer. 

Lets write a program that uses sequential operations and then we will repeat the same computation with vector computation. We will perform some trigonometric operations using traditional sequential processing. We choose trigonometric functions as they are CPU intensive and require a lot of computations before the results can be achieved.

public class ScalarTrigonometricComputations {
    public static void main(String[] args) {
        double[] angles = new double[20];
        double[] sinResults = new double[angles.length];
        double[] cosResults = new double[angles.length];
        double[] atanResults = new double[angles.length];

        for (int i = 0; i < angles.length; i++) {
            angles[i] = i * Math.PI / 20; // gradually increasing angle
        }

        long start = System.nanoTime();
        for (int i = 0; i < angles.length; i++) {
            sinResults[i] = Math.sin(angles[i]);
            cosResults[i] = Math.cos(angles[i]);
            atanResults[i] = Math.atan(angles[i]);
        }
        long end = System.nanoTime();
        System.out.println("Scalar Time (ms): " + (end - start) / 1_000_000.0);

// Print the results
        System.out.printf("%-10s %-10s %-10s %-10s%n", "Angle", "Sin", "Cos", "Atan");
        for (int i = 0; i < angles.length; i++) {
            System.out.printf("%-10.5f %-10.5f %-10.5f %-10.5f%n",
                    angles[i], sinResults[i], cosResults[i], atanResults[i]);
         }
     }
}

When run on a modern CPU, this program took 90-120 milliseconds to execute for me.
The table below is an indicative comparison of the time taken by Scalar computations VS Vector computations. Your numbers may vary, but the order of magnitude of performance optimization achieved will remain the same.  

Operation Type Approx Time (ms)
Scalar (Math.*) ~90–120 ms
Vector API  ~40–60 ms

Fig: Scalar vs Vector Computations: Benchmark Results

Let’s now repeat the same with Vector computations.

A vector is represented by the abstract class Vector<E>, where type <E> can be byte/short/integer/long/float/double etc.

A vector can be visualized as a sequence of a fixed number of lanes, with each lane having an element of the same data type. Operations on vectors are typically lane-wise, distributing some scalar operator (e.g. multiplication) across the lanes of the participating vectors, usually generating a vector result whose lanes contain the various individual scalar results. 

Fig: Scalar Vs Vector Operations


When run on a supporting CPU architecture, lane-wise operations can be executed in parallel by the hardware, typically using a single CPU instruction to finish the whole job. 

This style of parallelism is called Single Instruction Multiple Data (SIMD) parallelism. 

If you don’t want certain elements of a Vector to participate in the computation, you can hide them behind a VectorMask, which is an ordered immutable sequence of boolean values. 

A vector mask has the same number of lanes as a Vector, and elements inside a Vector will not take part in a computation if the corresponding lane in the vector mask contains a false. 

That is, conditional execution of an operation on Vector lanes/elements, can be achieved using masked operations, such as blend(), under the control of an associated VectorMask. 

Not only this, a VectorMask prevents wasted computation on inactive lanes. It enables hardware acceleration by leveraging native CPU SIMD instructions effectively, and replaces scalar conditional logic with efficient vectorized masking. 

The table below summarizes the utility of a VectorMask. 

FeatureRole of `VectorMask`
Conditional executionApplies operations to selected vector lanes only 
Safer access to MemoryPrevents out-of-bounds vector access
Performance Enhancement Enables SIMD acceleration with reduced branching
Functional clarity   Express logic in clean, branch-free, vector-friendly way

Fig: VectorMask Feature Summary

Let’s repeat the above program, and try to execute the same trigonometric operations with Vector computations. 

import jdk.incubator.vector.*;
import java.util.*; 

public class TrigonometricLanewiseVectorComputations {

    public static void main(String[] args) {
        double[] angles = new double[20];
        for (int i = 0; i < angles.length; i++) {
            angles[i] = i * Math.PI / 20; // 0, π/8, π/4, ...
        }

        VectorSpecies<Double> SPECIES =  DoubleVector.SPECIES_256; // for 256-bit SIMD (4 doubles)

        double[] sinResults = new double[angles.length];
        double[] cosResults = new double[angles.length];
        double[] atanResults = new double[angles.length];

	long start = System.nanoTime(); 
	
        for (int i = 0; i < angles.length; i += SPECIES.length()) {
            var mask = SPECIES.indexInRange(i, angles.length);
            var angleVec = DoubleVector.fromArray(SPECIES, angles, i, mask);

            // Perform sin, cos, atan elementwise (lanewise)
            var sinVec = angleVec.lanewise(VectorOperators.SIN, mask);
            var cosVec = angleVec.lanewise(VectorOperators.COS, mask);
            var atanVec = angleVec.lanewise(VectorOperators.ATAN, mask);

            sinVec.intoArray(sinResults, i, mask);
            cosVec.intoArray(cosResults, i, mask);
            atanVec.intoArray(atanResults, i, mask);
        }
		
		long end = System.nanoTime();
        System.out.println("Vector computation Time (ms): " + (end - start) / 1_000_000.0);  
		
        // Print the results
        System.out.printf("%-10s %-10s %-10s %-10s%n", "Angle", "Sin", "Cos", "Atan");
        for (int i = 0; i < angles.length; i++) {
            System.out.printf("%-10.5f %-10.5f %-10.5f %-10.5f%n",
                    angles[i], sinResults[i], cosResults[i], atanResults[i]);
        }
    }
}

These results are specific to my operating system and computer. Your results may vary based on your setup. These results clearly show that Vector computations are much more efficient and resource-saving. The graph below depicts the performance benefits achieved.


Fig: Graphical Representation of Benchmark Results

What Are the Tangible Benefits of Vector Computations in Modern Computing? 

There are many benefits of opting for Vector computations. A few of them are:- 

  • Data-heavy and computation-intensive applications –  especially in the fields of artificial intelligence, machine learning, scientific computing, and analytics – gain in performance. 
  • The Vector API works in conjunction with vector databases to optimize operations such as similarity search, and finding relationships between data points based on their semantic meaning 
  • The underlying CPU and hardware resources are used optimally, 
  • Other processes and applications get a bigger share of computing resources, thereby keeping your application responsive and nimble.   

Conclusion

We have demonstrated the performance benefits that can be achieved through Vector computations, and how these enhancements can help program modern Java applications. 

Share your Thoughts!

Up ↑

Index

Discover more from yCrash

Subscribe now to keep reading and get access to the full archive.

Continue reading