Page cover image

Introduction

Why do you need Wiki Edit Blocks dataset and BloArk data structure?

Check out the "last modified" date on the bottom right corner to notify the potential updates to this documentation. We are constantly improving it every day.

The goal of this instruction

The goal of this documentation is to process Wiki Edit Blocks (aka Wikidata in our context) into structures of your like. The underlying packages (e.g. BloArk) only do the job of scaling your efficiency and simplifying your coding process.

There are a huge potential on how we can use Wiki Edit Blocks dataset to empower the future data mining and NLP research. If you ever have a question related to this documentation, Wiki Edit Blocks dataset, or BloArk, feel free to drop an email to research@lingxi.li.

The data structure

All of the instructions here are based on a high-efficient data architecture called BloArk. It is a new way to store revision-based data, exponentially improving the processing and indexing speed.

The best part of this data architecture is the reusability by our standard protocol. In research, we usually modify datasets based on an existing dataset, but because of structure differences of different datasets, we are unable to easily adopt any efficient dataset processing pipeline onto it. With BloArk, any dataset coming out from it can be reused and easily modified by BloArk again. It should significantly simplify the preprocessing pipeline and increase productivity in processing a huge amount of data, especially revision-based.

Last updated