Slurm is a natural fit for large Wikipedia Edit Blocks jobs — you can fan out processing over many nodes while keeping each task focused on a manageable slice of the dataset.
The example below shows a simple Slurm job that runs a Python script using a conda environment with BloArk installed:
Bash
#!/bin/bash #SBATCH --job-name=wikidata-job #SBATCH --time=04:00:00 #SBATCH --cpus-per-task=4 #SBATCH --mem=16G source ~/.bashrc conda activate wikidata-env python process_wikidata.py --data-dir ./wikidata-edit-blocks
You can extend this pattern to array jobs that process different slices of the dataset by passing additional arguments (for example, a shard index) into your Python script.
© 2025 Lingxi Li.
San FranciscoSF