Spring Batch doesn’t actually process large datasets; it orchestrates the reading, processing, and writing of data in manageable chunks.

Let’s say you have a massive CSV file of customer transactions, millions of rows. You need to update a database with this data. Loading the entire file into memory is a non-starter. Spring Batch comes in by reading this file not all at once, but in batches of, say, 1000 records at a time. Each batch is then handed off to your custom logic for processing (e.g., validating customer IDs, calculating totals) and finally written to the database. The key is that at any given moment, only a small portion of the total dataset is actively being worked on.

Here’s a peek at what that looks like in code:

@Configuration
@EnableBatchProcessing
public class BatchConfig {

    @Autowired
    private JobBuilderFactory jobBuilderFactory;

    @Autowired
    private StepBuilderFactory stepBuilderFactory;

    @Bean
    public FlatFileItemReader<Customer> customerItemReader() {
        FlatFileItemReader<Customer> reader = new FlatFileItemReader<>();
        reader.setResource(new ClassPathResource("customers.csv")); // Your input file

        reader.setLineMapper(new DefaultLineMapper<Customer>() {{


            setLineTokenizer(new DelimitedLineTokenizer() {{

                setNames(new String[]{"id", "name", "email"});
            }});

            setFieldSetMapper(new BeanWrapperFieldSetMapper<Customer>() {{

                setTargetType(Customer.class);
            }});
        }});
        return reader;
    }

    @Bean
    public CustomerItemProcessor customerItemProcessor() {
        return new CustomerItemProcessor();
    }

    @Bean
    public JdbcBatchItemWriter<Customer> customerItemWriter() {
        JdbcBatchItemWriter<Customer> writer = new JdbcBatchItemWriter<>();
        writer.setDataSource(dataSource()); // Your JDBC DataSource
        writer.setSql("INSERT INTO CUSTOMER (ID, NAME, EMAIL) VALUES (:id, :name, :email)");
        writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider());
        return writer;
    }

    @Bean
    public Step importCustomerStep() {
        return stepBuilderFactory.get("importCustomerStep")
            .<Customer, Customer>chunk(100) // Process 100 records at a time
            .reader(customerItemReader())
            .processor(customerItemProcessor())
            .writer(customerItemWriter())
            .build();
    }

    @Bean
    public Job importCustomerJob() {
        return jobBuilderFactory.get("importCustomerJob")
            .incrementer(new RunIdIncrementer())
            .flow(importCustomerStep())
            .end()
            .build();
    }

    // DataSource configuration would go here
    // ...
}

This configuration sets up a job named importCustomerJob with a single step, importCustomerStep. This step defines a chunk size of 100. This means Spring Batch will read 100 Customer objects from customers.csv using customerItemReader, pass those 100 objects to customerItemProcessor, and then pass the resulting 100 objects to customerItemWriter for database insertion. Once that chunk is complete, it repeats for the next 100, and so on, until the file is exhausted.

The problem Spring Batch solves is precisely the memory and performance limitations of processing vast amounts of data sequentially. By breaking down the work into discrete, manageable chunks, it allows you to process datasets that would otherwise be impossible to handle. Internally, it uses a persistent job repository (typically a relational database) to track the state of each job and step. This means if your batch job crashes mid-way, it can be restarted from where it left off, rather than starting from scratch. The chunk size is your primary lever for tuning performance. A larger chunk can lead to better throughput due to fewer commit points and potentially more efficient I/O, but it also increases memory usage. A smaller chunk reduces memory footprint but might introduce more overhead with frequent commits.

Most people focus on the chunk size, but the real performance killer in large dataset processing is often the database write operation. JdbcBatchItemWriter is good, but for truly massive datasets, consider using HibernateJpaItemWriter with batching enabled in your JPA configuration, or even more specialized writers that leverage native bulk insert capabilities of your database. The flushInterval and batchSize properties on the JPA EntityManager or Hibernate Session are critical here.

The next hurdle for large datasets is often handling exceptions within a chunk.

Want structured learning?

Take the full Spring-boot course →