Getting Started

Prepare your environment

Terminal requirements

Be comfortable working in a terminal. Linux or macOS is preferred because all commands in this documentation assume a POSIX shell. PowerShell on Windows can run the same workflow, but some steps may look different.

Before starting any task, prepare your local environment:

A dedicated Python environment manager such as conda. Other solutions like VirtualEnv also work. We recommend Miniforge because it supports Apple Silicon well.
Python 3.9 or newer. Create a fresh environment with at least Python 3.9; older versions may not be fully supported.
Optional access to a Slurm cluster (for example, UMass Swarm or UMass Unity) if you plan to run large-scale preprocessing jobs.

We will use a pinned version of BloArk in examples. You can experiment with newer versions, but if you hit issues, roll back to the version documented here and consult the package changelog.

Bash
pip install "bloark==2.3.3"

Download the dataset

The Wikipedia Edit Blocks dataset is hosted on HuggingFace as a standard dataset. You download it once to a local directory, then point your BloArk pipelines at that directory for all downstream processing.

Use the HuggingFace Hub Python client to pull the dataset locally:

Bash
pip install "huggingface_hub>=0.24.0"

Python
from huggingface_hub import snapshot_download

local_dir = "./wikidata-edit-blocks"

snapshot_download(
    # Replace with the dataset ID on HuggingFace.
    repo_id="<dataset-id>",
    repo_type="dataset",
    local_dir=local_dir,
)

Check the dataset page on HuggingFace for the exact repo_id and any additional instructions provided by the maintainers.

Tip

If you are working on a shared cluster, download the dataset into a shared scratch or network-mounted directory so multiple jobs can reuse the same files without re-downloading.

Open the dataset

After downloading the dataset, you interact with it through BloArk. A BloArk instance gives you a simple iterator over blocks, so you can focus on your research logic instead of file layouts.

Python
from bloark import BloArk

dataset = BloArk(data_path="./wikidata-edit-blocks")

for block in dataset:
    # Each block contains a slice of the edit history.
    # Add your own processing logic here.
    ...

You can wrap this pattern in your own utilities — for example, to filter by namespace, article ID, or snapshot date — without worrying about how files are laid out on disk.

Build a derived dataset

Most projects do not use the full Wikipedia Edit Blocks dataset directly. Instead, you usually build a smaller derived dataset that matches your task — for example, a subset of articles, a time slice, or a filtered set of revisions.

Common patterns:

Filter by language, namespace, or page ID to keep only the slices you care about.
Subsample articles or revisions to create smaller debugging datasets.
Add derived features such as time between edits, editor activity statistics, or labels for downstream models.

At a high level, you:

Open the original dataset with BloArk.
Iterate over blocks and apply your own filter or transformation.
Write the transformed blocks into a new BloArk dataset on disk.

Keeping the derived dataset in BloArk format makes it easy to reuse your work and share it with collaborators.

San FranciscoSF