Once you have built a derived dataset on top of Wikipedia Edit Blocks, you may want to share it with collaborators or the broader community. Using BloArk as the on-disk format makes this straightforward — anyone can plug your dataset into the same tooling.
README next to the dataset that explains what was filtered or transformed compared to the original Wikipedia Edit Blocks release.Publishing a derived dataset as another BloArk dataset preserves the benefits of the original format while letting others skip the most expensive preprocessing steps.
© 2025 Lingxi Li.
San FranciscoSF