Distribute your dataset

Once you have built a derived dataset on top of Wikipedia Edit Blocks, you may want to share it with collaborators or the broader community. Using BloArk as the on-disk format makes this straightforward — anyone can plug your dataset into the same tooling.

Recommended practices

  • Keep a short README next to the dataset that explains what was filtered or transformed compared to the original Wikipedia Edit Blocks release.
  • Version your derived datasets (for example, with a date or semantic version) so experiments can reference a specific snapshot.
  • Include small sample slices for quick inspection, so users can validate that they are loading the dataset correctly without scanning the full contents.

Publishing a derived dataset as another BloArk dataset preserves the benefits of the original format while letting others skip the most expensive preprocessing steps.

© 2025 Lingxi Li.

SF

Distribute your dataset - Wikidata