Run on Slurm cluster

Slurm is a natural fit for large Wikipedia Edit Blocks jobs — you can fan out processing over many nodes while keeping each task focused on a manageable slice of the dataset.

Minimal Slurm script

The example below shows a simple Slurm job that runs a Python script using a conda environment with BloArk installed:

Bash
#!/bin/bash
#SBATCH --job-name=wikidata-job
#SBATCH --time=04:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem=16G

source ~/.bashrc
conda activate wikidata-env

python process_wikidata.py --data-dir ./wikidata-edit-blocks

You can extend this pattern to array jobs that process different slices of the dataset by passing additional arguments (for example, a shard index) into your Python script.

San FranciscoSF