Terminal requirements
Be comfortable working in a terminal. Linux or macOS is preferred because all commands in this documentation assume a POSIX shell. PowerShell on Windows can run the same workflow, but some steps may look different.
Before starting any task, prepare your local environment:
We will use a pinned version of BloArk in examples. You can experiment with newer versions, but if you hit issues, roll back to the version documented here and consult the package changelog.
Bash
pip install "bloark==2.3.3"
The Wikipedia Edit Blocks dataset is hosted on HuggingFace as a standard dataset. You download it once to a local directory, then point your BloArk pipelines at that directory for all downstream processing.
Use the HuggingFace Hub Python client to pull the dataset locally:
Bash
pip install "huggingface_hub>=0.24.0"
Python
from huggingface_hub import snapshot_download local_dir = "./wikidata-edit-blocks" snapshot_download( # Replace with the dataset ID on HuggingFace. repo_id="<dataset-id>", repo_type="dataset", local_dir=local_dir, )
Check the dataset page on HuggingFace for the exact repo_id and any additional instructions provided by the maintainers.
Tip
If you are working on a shared cluster, download the dataset into a shared scratch or network-mounted directory so multiple jobs can reuse the same files without re-downloading.
After downloading the dataset, you interact with it through BloArk. A BloArk instance gives you a simple iterator over blocks, so you can focus on your research logic instead of file layouts.
Python
from bloark import BloArk dataset = BloArk(data_path="./wikidata-edit-blocks") for block in dataset: # Each block contains a slice of the edit history. # Add your own processing logic here. ...
You can wrap this pattern in your own utilities — for example, to filter by namespace, article ID, or snapshot date — without worrying about how files are laid out on disk.
Most projects do not use the full Wikipedia Edit Blocks dataset directly. Instead, you usually build a smaller derived dataset that matches your task — for example, a subset of articles, a time slice, or a filtered set of revisions.
Common patterns:
At a high level, you:
Keeping the derived dataset in BloArk format makes it easy to reuse your work and share it with collaborators.
© 2025 Lingxi Li.
San FranciscoSF