Wiki Edit Blocks
Wiki Edit Blocks
  • Introduction
  • ▶️Getting started
    • Preparation
    • Download dataset from HF
    • Checkout and get your data
    • Modify dataset for your needs
    • Use data with limited resource
    • Distribute your dataset
    • Run on Slurm cluster
  • 🚧Advanced usages
    • Build from source
  • 🛠️External Resources
    • BloArk Documentation
    • BloArk GitHub Repo
    • WikiDL Documentation
    • Grimm Documentation
    • Ergodiff Documentation
Powered by GitBook
On this page
  • The goal of this instruction
  • The data structure

Was this helpful?

Introduction

Why do you need Wiki Edit Blocks dataset and BloArk data structure?

NextPreparation

Last updated 1 year ago

Was this helpful?

Check out the "last modified" date on the bottom right corner to notify the potential updates to this documentation. We are constantly improving it every day.

The goal of this instruction

The goal of this documentation is to process Wiki Edit Blocks (aka Wikidata in our context) into structures of your like. The underlying packages (e.g. BloArk) only do the job of scaling your efficiency and simplifying your coding process.

There are a huge potential on how we can use Wiki Edit Blocks dataset to empower the future data mining and NLP research. If you ever have a question related to this documentation, Wiki Edit Blocks dataset, or BloArk, feel free to drop an email to research@lingxi.li.

The data structure

All of the instructions here are based on a high-efficient data architecture called . It is a new way to store revision-based data, exponentially improving the processing and indexing speed.

The best part of this data architecture is the reusability by our standard protocol. In research, we usually modify datasets based on an existing dataset, but because of structure differences of different datasets, we are unable to easily adopt any efficient dataset processing pipeline onto it. With BloArk, any dataset coming out from it can be reused and easily modified by BloArk again. It should significantly simplify the preprocessing pipeline and increase productivity in processing a huge amount of data, especially revision-based.

BloArk
Page cover image