Spring Batch: Building robust processing jobs for text files

If you’re a software engineer or an early software architect who is looking to process large volumes of data by building an ETL pipeline using Java looking for features like chunk-based processing, restartability, retry, parallelism, transactions and job scheduling, then you will find that Spring Batch proves to be an invaluable tool. Spring Batch is a lightweight, robust batch processing framework offered by the Spring ecosystem specifically designed for building enterprise grade batch processing apps using Java. Whether we’re processing millions of records, running ETL pipelines or nightly batch jobs, Spring Batch abstracts the heavy lifting under the hood. Spring Batch maintains job execution state in database tables, which is why a database is mandatory even for simple batch jobs. The framework creates six primary tables and three sequences. We talk about these tables in detail later in this article. Taking a gander at these auto generated tables is critical for troubleshooting and monitoring our batch jobs. The metadata stored in these tables allows Spring Batch to track job statuses, handle restarts and ensure exactly-once processing. Since spring batch jobs typically run in high load environments, memory and thread issues could find their way into your batch pipelines. For debugging these issues, we can plug in external memory diagnostic tools like yCrash or GCeasy. Suppose your batch job begins normally but slows down halfway through execution. This could be caused by blocked worker threads, deadlocks, or GC events. In such circumstances, software engineers typically analyze the internal state of the app using JVM metrics like heap dumps, garbage collector logs, or thread dumps using tools like GCeasy (for GC log analysis), HeapHero (for memory leak detection), and fastThread (for inspecting thread dumps) which can quickly pinpoint issues like memory leaks resulting in OutOfMemoryError, flag objects consuming excessive heap space, thread contention, or excessive garbage collection.

Understanding the nuts and bolts of Spring Batch

What is a Spring Batch Job

A Spring Batch job consists of one or multiple steps which can execute either sequentially or parallely. The job creation process uses the JobBuilder API and a JobRepository (similar to JPA repository if you’re familiar with JPA), which serves as the data access layer. Steps are created using the StepBuilder, which similarly requires a name and the JobRepository. For our first example, we use a Tasklet, which is ideal for simple, non-repetitive tasks. 

@ Bean
    public Job firstJob(Step step1) {
        return new JobBuilder("My first Spring Batch Job", jobRepo)
                .start( step1 )
                .next( step2 )
                .build( );
    }

    @ Bean
    public Step step1( ) {
        return new StepBuilder("step1", jobRepo)
                .tasklet((contribution, context) -> {
                     log.info("Tasklet got executed" );
                     return RepeatStatus.FINISHED ;
                },  platformTransactionMgr)
                .build( );
    }

A tasklet is perfect for tasks that don’t follow the sophisticated read-process-write pattern, such as printing a message, executing a single database operation, or performing cleanup activities. A In the above code snippet we have a tasklet defined as a lambda function, which executes once and signals completion by returning RepeatStatus.FINISHED. Once we define both the job and step as Spring beans, our minimal Spring Batch job configuration is complete. When we run the application, we’ll see log messages in the console indicating that the job launched and display its metadata like empty parameters, the step executed in a few milliseconds, and the job completion status. This simple example establishes the foundation for more complex batch processing scenarios which we will see later on in this article.

Controlling the Spring Batch Job Execution Programmatically

By default, Spring Batch automatically launches all configured jobs when the application starts. This kind of behavior isn’t always desirable. For us to gain programmatic control over job execution, we can implement the CommandLineRunner interface in our main class, which executes after the application context loads (code to follow below). More importantly, we add the property `spring.batch.job.enabled=false` to our application configuration. This crucial setting instructs Spring Batch to avoid automatically running jobs on startup, thus giving us complete control over job execution timing. The command line runner approach is a naive example and starting point, we can have more sophisticated triggering mechanisms using REST APIs and/or schedulers.

Passing job parameters to a Spring Batch Job

Ensuring Unique Job Instances with Job Parameters

Job parameters are a bunch of immutable key-value pairs which Spring Batch uses to determine whether a job instance has already been executed or not. If we attempt to run a job with the exact same parameters as one of its previous executions, Spring Batch will prevent or block it from being launched thus ensuring that each unique parameter set represents a distinct job instance. 

To define our job parameters, we use the JobParametersBuilder API, which follows the builder design pattern like most of the other APIs in Spring Batch.

 @ Bean
    public CommandLineRunner executesAtStartUp(Job employeeJob) {
        return args -> {
            JobParametersBuilder jpb = new JobParametersBuilder( );
            // attach key value pairs
            jpb.addJobParameter("dept_name", "xyz_dept", String.class);
            jpb.addJobParameter("dept_id", 123, Integer.class);
            jpb.addJobParameter("jobRun", LocalDateTime.now( ), LocalDateTime.class);
            jobLauncher.run(employeeJob, jpb.toJobParameters( ));

        };
    }

If we run the same job again with identical parameters, we would see a JobInstanceAlreadyCompleteException be thrown at us in the console as:

org.springframework.batch.core.repository.JobInstanceAlreadyCompleteException: A job instance already exists and is complete for parameters={}. If you want to run this job again, change the parameters.

Thus Spring Batch prevents duplicate executions. To enable repeated job runs during development, a simple hack would be to include a timestamp parameter using LocalDateTime.now(), ensuring each execution has unique parameters. 

Understanding the Spring Batch metadata tables

The state management is powered via relational database tables, although I observed they seemed to have recently added support for MongoDB, a NoSQL database six months prior to writing this article. The metadata provides powerful capabilities but requires understanding to avoid common pitfalls. The framework prevents duplicate processing of the same logical job instance, ensuring exactly once semantics using these tables. The relationship between a job instance, its execution(s) and step(s) is well defined, neat and also hierarchical. One job instance can have multiple executions or runs (in case of failures and restarts), and each execution is linked to multiple step executions. 

BATCH_JOB_EXECUTION

This table contains rows for each job execution, with columns for job execution ID, job instance ID, start time, end time, status, and exit code.  

BATCH_JOB_INSTANCE

This table stores job definitions, with each unique combination of job name and parameters creating a new instance. The job instance ID serves as a foreign key linking to the job execution table. If we run the same job multiple times with different parameters, you’ll see multiple instances, each potentially having one or more execution attempts.

BATCH_STEP_EXECUTION

This is a table that tracks each step’s execution within a job. This table includes the step name, job execution ID which is foreign key to the job instance table, start and end times, status, commit count, read count, write count, and filter count. The metrics in this table provide insights into our batch processing performance and help identify bottlenecks. If we ever run into performance issues, we can query these tables and calculate overall throughput of the system periodically. This would mean a few moderately complex SQL queries involving joins, but it will help us understand the root cause of performance degradation and latency. More often than not, in my experience, long GC pauses slow down the batch jobs. 

Here’s a good read on troubleshooting GC pause related issues: Diagnosing and Resolving Garbage Collection Pauses in the ServiceNow MID server

The Job Parameters and Context Tables

A few additional tables include BATCH_JOB_EXECUTION_PARAMS for storing job parameters, BATCH_JOB_EXECUTION_CONTEXT and BATCH_STEP_EXECUTION_CONTEXT for maintaining state information that enables restart capabilities. 

Spring Batch Case Study: Building a complete data ETL Pipeline

This is an interesting case study which demonstrates how to build a complete Spring Batch pipeline. This case study teaches the fundamentals of creating custom ItemReaders, ItemProcessors, and ItemWriters. We assume that we have a sample CSV containing 5 employee records with fields for ID, name, location, and email address.

Consider this mock data as the CSV:

1,Bill,Glasgow,Bill@gmail.com

2,James,Mumbai,James@gmail.com

3,Joe,Glasgow,joe@gmail.com

4,Toby,London,Toby@gmail.com

5,Henry,Glasgow,Henry@gmail.com

The mock business requirements for this case study are clean as a whistle: 

  • Read all the employee data from the CSV file
  • Filter out all employees based in Glasgow
  • Migrate all employee email domains from “gmail” to “yahoo”
  • Write the remaining employee records to the console in JSON format

We are taking baby steps here & this seemingly simple task would introduce us to the core components of Spring Batch’s chunk-oriented processing model. While this case study aims at laying the foundation and uses custom implementations, it’s important to note that Spring Batch provides tens of built in readers and writers for common scenarios. The custom approach here is intentionally chosen to help us get acquainted to and understand how these components work under the hood. In an enterprise setting, nine times out of ten you would typically use Spring Batch’s provided implementations. 

Creating a custom ItemReader in Spring Batch

Before we go ahead and write our own reader and plug it into our job definition, we define the Employee model class, the fields of which are exactly matching with the columns in the CSV file. The abstract method we need to implement in the reader is `read()`, which Spring Batch calls repeatedly until it returns null, signaling the end of data. A null value signals to Spring Batch that the reading phase is complete and it should proceed to finalize the chunk and write the accumulated items.

 @  Override
    public Employee read( ) throws Exception {
        if (count < lines.size( )) {
            String line = lines.get(count++ );
            String[] columnValues = line.split( "," );
            long id = Long.parseLong( columnValues[0].trim() );
            String name = columnValues[1].trim();
            String location = columnValues[2].trim();
            String email =columnValues[3].trim();
            return new Employee(id, name, location, email);
        }
        return null; //EOF
    }

While this implementation is not production grade and lacks exception handling, it is fine for learning purposes. But it has limitations: hard-coding column indices and delimiters isn’t maintainable for real applications. Additionally, storing all lines in memory isn’t suitable for large files. These issues will be addressed later when you learn about Spring Batch’s built-in flat file readers that handle these concerns elegantly through configuration. 

Creating a Custom ItemWriter for JSON Output

The ItemWriter receives chunks of employee objects and writes them to the output destination. For writing JSON output, we use Jackson’s ObjectMapper from the Spring Web dependency. The Chunk parameter represents a batch of items that were successfully read and processed. Spring Batch determines chunk size through configuration in the step definition. For example, with a chunk size of 3, the writer receives three items at a time (or fewer if approaching the end of data). This chunking mechanism provides transaction boundaries, allowing Spring Batch to commit work in manageable pieces and enabling restart. 

Creating a Custom ItemProcessor for Filtering and Transformation

The ItemProcessor sits between the reader and writer, applying business logic to transform or filter items. The process method receives an employee object and returns either the modified employee or null to filter it out.

When the processor returns null, Spring Batch excludes that item from the chunk being written.

It’s important to note that the process method executes for each item individually, unlike the write method which operates on chunks. This design allows fine-grained control over item processing. The processor can also transform items with modified fields, enabling data enrichment, format conversion, or business rule application.

After implementing the reader, writer and processor, we need to wire it into our step definition. With this done, we’ll see Spring Batch process the data, filter based on location, and output the results in JSON format. 

Chaining ItemProcessors Using CompositeItemProcessor in Spring Batch

In real-world scenarios, we often need to apply multiple processing layers to our data. Spring Batch supports clubbing multiple processors together under the same umbrella using the CompositeItemProcessor. This allows us to separate concerns, making our code more testable and maintainable. For this example, we’ll introduce a second processor that transforms email addresses: converting them to lowercase and migrating each employee’s domain from “gmail” to “yahoo” as per requirements of our problem statement.

In the EmployeeEmailProcessor’s process() method, we write the required logic as:

@Component
public class EmployeeEmailProcessor implements ItemProcessor<Employee, EmployeeProcessed> {

    @ Override
    public EmployeeProcessed process(final Employee employee) throws Exception {
        EmployeeProcessed employeeProcessed = new EmployeeProcessed();
        employeeProcessed.setId(employee.getId());
        employeeProcessed.setName(employee.getName());
        employeeProcessed.setLocation(employee.getLocation());
        employeeProcessed.setEmail(transformEmail(employee.getEmail()));
        return employeeProcessed;
    }

    private String transformEmail(final String email) {
        // migrate the domain from gmail to yahoo & convert to lowercase
        return email.toLowerCase().replace("gmail", "yahoo");
    }
}

To chain the processors, we define a bean of type CompositeItemProcessor in our job config. The first processor filters the employees by location, and its output becomes the input to the second processor which transforms email domains. When we run the application, we see both processors working in tandem: first filtering out Glasgow-based employees, then transforming the emails of the remaining records.

How to Read Data from Multiple Sources in Spring Batch

While custom item readers help you understand the fundamentals, Spring Batch provides robust built-in readers for common scenarios. You will find yourself using them at work nine out of ten times, although if your use case or data format is non-standard, you may have to code your own custom reader.

Reading CSV Files Using FlatFileItemReader

The FlatFileItemReader is specifically designed for reading flat files like text separated by a delimiter. The reader requires several properties: the number of lines to skip (useful if your file has headers), the resource (file path), and a line mapper that converts each line to your domain object. For the line mapper, we can use DefaultLineMapper, which requires a line tokenizer and a field set mapper.

Reading Excel Files Using Apache POI 

Excel files are pretty common in enterprise environments. Spring Batch doesn’t include an Excel reader out of the box, but the Apache POI library integration provides this functionality. We have two options for row mapping. The first option is to go ahead and create a custom EmployeeRowMapper (if you have worked on JDBC before, this may sound familiar to you). The second option uses Spring’s BeanWrapperRowMapper, which automatically maps Excel columns to object properties.

Conclusion: Building reliable and scalable ETL jobs with Spring Batch

In this article we explored how to build complete batch processing pipelines with Spring Batch. We learnt to create jobs and steps, build data pipelines & work with various common data formats like CSV and excel. We coded our own custom implementations and then explored Spring Batch’s built-in components which demonstrates both the framework’s flexibility and its comprehensive feature set. As you continue your Spring Batch journey, you should explore additional topics like parallel processing, partitioning, job orchestration, and integration with Spring Cloud Task for cloud-native batch processing. The patterns and practices learned here will provide you a solid foundation for building enterprise-grade batch applications that process millions of records reliably and efficiently. As enterprises build ETL jobs with Spring Batch, combining robust processing pipelines with observability tools such as yCrash, GCeasy, fastThread, and HeapHero can open the door to automated production diagnostics and yield better throughput.

Share your Thoughts!

Up ↑

Index

Discover more from yCrash

Subscribe now to keep reading and get access to the full archive.

Continue reading