Process Wikipedia Edit Blocks into structures of your like.

Python
from bloark import BloArk

# Initialize BloArk with your dataset
bloark = BloArk(data_path="./your-dataset")

# Process and access your data
for block in bloark:
    # Your processing logic here
    pass

Overview

Wikipedia Edit Blocks is a high-efficiency snapshot of Wikipedia edit history, packaged as reusable blocks for research. This documentation shows how to go from the raw dataset to analysis-ready structures with as little custom plumbing as possible.

Why BloArk?

All examples in this documentation use the BloArk data architecture. BloArk is optimized for revision-based datasets and is designed to make indexing and processing large histories fast and predictable.

With BloArk, you can:

Reuse the same processing pipeline across multiple derived datasets.
Share intermediate datasets easily, since they all follow the same layout.
Scale to large edit histories without rewriting low-level I/O code.

If you have questions about this documentation, the Wikipedia Edit Blocks dataset, or BloArk itself, feel free to reach out to the maintainers listed on the project website.

San FranciscoSF