banner



How Much Data Is All Of Wikipedia

How to programmatically download and parse the Wikipedia

(Source)

Wikipedia is i of modern humanity's well-nigh impressive creations. Who would accept idea that in just a few years, bearding contributors working for free could create the greatest source of online knowledge the earth has e'er seen? Not just is Wikipedia the all-time place to go information for writing your college papers, but it's also an extremely rich source of data that can fuel numerous information scientific discipline projects from tongue processing to supervised machine learning.

The size of Wikipedia makes it both the earth'southward largest encyclopedia and slightly intimidating to work with. However, size is not an issue with the right tools, and in this commodity, we'll walk through how nosotros tin programmatically download and parse through all of the English language Wikipedia.

Forth the way, we'll cover a number of useful topics in data science:

  1. Finding and programmatically downloading information from the web
  2. Parsing web information (HTML, XML, MediaWiki) using Python libraries
  3. Running operations in parallel with multiprocessing/multithreading
  4. Benchmarking methods to find the optimal solution to a problem

The original impetus for this project was to collect data on every unmarried book on Wikipedia, but I soon realized the solutions involved were more broadly applicable. The techniques covered here and presented in the accompanying Jupyter Notebook will let you efficiently work with whatever articles on Wikipedia and can exist extended to other sources of spider web data.

If y'all'd like to come across more virtually utilizing the data in this article, I wrote a post using neural network embeddings to build a volume recommendation system.

The notebook containing the Python code for this article is bachelor on GitHub. This project was inspired past the excellent Deep Learning Cookbook by Douwe Osinga and much of the code is adapted from the book. The volume is well worth it and you can access the Jupyter Notebooks at no toll on GitHub.

Finding and Downloading Data Programmatically

The start footstep in any data science projection is accessing your information! While we could make private requests to Wikipedia pages and scrape the results, we'd chop-chop run into rate limits and unnecessarily revenue enhancement Wikipedia'due south servers. Instead, we can access a dump of all of Wikipedia through Wikimedia at dumps.wikimedia.org. (A dump refers to a periodic snapshot of a database).

The English version is at dumps.wikimedia.org/enwiki. We view the available versions of the database using the following lawmaking.

                import requests                # Library for parsing HTML
from bs4 import BeautifulSoup
base_url = 'https://dumps.wikimedia.org/enwiki/'
index = requests.go(base_url).text
soup_index = BeautifulSoup(index, 'html.parser')
# Find the links on the page
dumps = [a['href'] for a in soup_index.find_all('a') if
a.has_attr('href')]
dumps
['../',
'20180620/',
'20180701/',
'20180720/',
'20180801/',
'20180820/',
'20180901/',
'20180920/',
'latest/']

This code makes use of the BeautifulSoup library for parsing HTML. Given that HTML is the standard markup linguistic communication for web pages, this is an invaluable library for working with web data.

For this project, we'll take the dump on September 1, 2018 (some of the dumps are incomplete so make sure to cull one with the data you demand). To find all the bachelor files in the dump, we use the following code:

                dump_url = base_url + '20180901/'                # Call back the html
dump_html = requests.go(dump_url).text
# Convert to a soup
soup_dump = BeautifulSoup(dump_html, 'html.parser')
# Observe list elements with the course file
soup_dump.find_all('li', {'class': 'file'})[:3]
[<li class="file"><a href="/enwiki/20180901/enwiki-20180901-pages-manufactures-multistream.xml.bz2">enwiki-20180901-pages-articles-multistream.xml.bz2</a> 15.2 GB</li>,
<li course="file"><a href="/enwiki/20180901/enwiki-20180901-pages-articles-multistream-alphabetize.txt.bz2">enwiki-20180901-pages-articles-multistream-alphabetize.txt.bz2</a> 195.6 MB</li>,
<li class="file"><a href="/enwiki/20180901/enwiki-20180901-pages-meta-history1.xml-p10p2101.7z">enwiki-20180901-pages-meta-history1.xml-p10p2101.7z</a> 320.6 MB</li>]

Again, we parse the webpage using BeautifulSoup to find the files. We could go to https://dumps.wikimedia.org/enwiki/20180901/ and look for the files to download manually, but that would exist inefficient. Knowing how to parse HTML and interact with websites in a program is an extremely useful skill considering how much data is on the web. Learn a footling web scraping and vast new data sources become attainable. (Hither's a tutorial to go you lot started).

Deciding what to Download

The above lawmaking finds all of the files in the dump. This includes several options for download: the current version of only the articles, the manufactures forth with the electric current discussion, or the articles along with all past edits and discussion. If nosotros go with the latter selection, nosotros are looking at several terabytes of information! For this project, we'll stick to the nigh recent version of only the articles. This page is useful for determining which files to get given your needs.

The current version of all the manufactures is bachelor as a single file. Notwithstanding, if we go the single file, then when nosotros parse it, we'll be stuck going through all the articles sequentially — 1 at a time — a very inefficient approach. A amend option is to download partitioned files, each of which contains a subset of the articles. And so, as we'll run into, we can parse through multiple files at a fourth dimension through parallelization, speeding upward the process significantly.

When I'm dealing with files, I would rather take many pocket-sized files than 1 large file because and so I tin can parallelize operations on the files.

The partitioned files are available equally bz2-compressed XML (eXtended Markup Linguistic communication). Each segmentation is around 300–400 MB in size with a total compressed size of 15.4 GB. We won't need to decompress the files, merely if yous choose to do so, the unabridged size is around 58 GB. This really doesn't seem likewise large for all of human knowledge! (Okay, not all knowledge, only all the same).

Compressed Size of Wikipedia (Source).

Downloading Files

To really download the files, the Keras utility get_file is extremely useful. This downloads a file at a link and saves it to disk.

                from keras.utils import get_file                saved_file_path = get_file(file, url)              

The files are saved in ~/.keras/datasets/, the default save location for Keras. Downloading all of the files one at a time takes a little over two hours. (Y'all tin can attempt to download in parallel, but I ran into rate limits when I tried to make multiple requests at the same time.)

Parsing the Data

It might seem like the first matter we want to do is decompress the files. All the same, it turns out nosotros won't ever really need to practise this to access all the data in the articles! Instead, we tin can iteratively work with the files by decompressing and processing lines ane at a time. Iterating through files is oft the only pick if we work with big datasets that do non fit in retention.

To iterate through a bz2 compressed file nosotros could utilize the bz2 library. In testing though, I plant that a faster option (past a factor of 2) is to call the system utility bzcat with the subprocess Python module. This illustrates a critical betoken: often, there are multiple solutions to a problem and the only way to find what is most efficient is to criterion the options. This tin can be equally simple as using the %%timeit Jupyter cell magic to fourth dimension the methods.

For the complete details, run across the notebook, but the basic format of iteratively decompressing a file is:

                data_path = '~/.keras/datasets/enwiki-20180901-pages-articles15.xml-p7744803p9244803.bz2                # Iterate through compressed file i line at a time
for line in subprocess.Popen(['bzcat'],
stdin = open(data_path),
stdout = subprocess.Piping).stdout:
# process line

If nosotros merely read in the XML data and suspend it to a listing, nosotros get something that looks like this:

Raw XML from Wikipedia Article.

This shows the XML from a unmarried Wikipedia article. The files we take downloaded contain millions of lines like this, with thousands of articles in each file. If we really wanted to brand things hard, we could become through this using regular expressions and cord matching to find each article. Given this is extraordinarily inefficient, we'll take a better approach using tools custom built for parsing both XML and Wikipedia-fashion manufactures.

Parsing Approach

We demand to parse the files on two levels:

  1. Extract the article titles and text from the XML
  2. Extract relevant information from the article text

Fortunately, there are good options for both of these operations in Python.

Parsing XML

To solve the first trouble of locating manufactures, we'll utilise the SAX parser, which is "The Simple API for XML." BeautifulSoup can besides be used for parsing XML, merely this requires loading the unabridged file into memory and building a Document Object Model (DOM). SAX, on the other paw, processes XML 1 line at a time, which fits our arroyo perfectly.

The bones idea we demand to execute is to search through the XML and extract the information betwixt specific tags (If you lot need an introduction to XML, I'd recommend starting hither). For case, given the XML below:

                <title>Carroll F. Knicely</championship>
<text xml:space="preserve">\'\'\'Carroll F. Knicely\'\'\' (born c. 1929 in [[Staunton, Virginia]] - died November 2, 2006 in [[Glasgow, Kentucky]]) was [[Editing|editor]] and [[Publishing|publisher]] of the \'\'[[Glasgow Daily Times]]\'\' for nearly 20 years (and afterward, its owner) and served under three [[Governor of Kentucky|Kentucky Governors]] every bit commissioner and later Commerce Secretary.\n'
</text>

We want to select the content between the <title> and <text> tags. (The title is simply the Wikipedia page championship and the text is the content of the article). SAX will let us do exactly this using a parser and a ContentHandler which controls how the information passed to the parser is handled. We pass the XML to the parser one line at a time and the Content Handler lets us extract the relevant data.

This is a niggling hard to follow without trying it out yourself, but the idea is that the Content handler looks for sure start tags, and when it finds one, information technology adds characters to a buffer until information technology encounters the aforementioned end tag. Then information technology saves the buffer content to a dictionary with the tag as the central. The result is that nosotros go a dictionary where the keys are the tags and the values are the content between the tags. We can and so send this dictionary to another function that will parse the values in the lexicon.

The only office of SAX nosotros need to write is the Content Handler. This is shown in its entirety below:

Content Handler for SAX parser

In this code, we are looking for the tags championship and text . Every time the parser encounters ane of these, it will save characters to the buffer until it encounters the same end tag (identified by </tag>). At this point it will salve the buffer contents to a dictionary — self._values . Articles are separated past <page> tags, so if the content handler encounters an ending </page> tag, and so it should add together the self._values to the listing of articles, self._pages. If this is a petty confusing, so perhaps seeing it in action will assist.

The code beneath shows how we employ this to search through the XML file to find articles. For now we're only saving them to the handler._pages attribute, but later we'll ship the articles to another function for parsing.

                # Object for treatment xml
handler = WikiXmlHandler()
# Parsing object
parser = xml.sax.make_parser()
parser.setContentHandler(handler)
# Iteratively procedure file
for line in subprocess.Popen(['bzcat'],
stdin = open(data_path),
stdout = subprocess.PIPE).stdout:
parser.feed(line)

# Stop when 3 articles have been plant
if len(handler._pages) > two:
break

If we inspect handler._pages , we'll come across a list, each element of which is a tuple with the championship and text of one article:

                handler._pages[0]                                  [('Carroll Knicely',
"'''Carroll F. Knicely''' (born c. 1929 in [[Staunton, Virginia]] - died November 2, 2006 in [[Glasgow, Kentucky]]) was [[Editing|editor]] and [[Publishing|publisher]] ...)]

At this point we take written lawmaking that tin successfully identify articles within the XML. This gets us halfway through the procedure of parsing the files and the next step is to process the manufactures themselves to find specific pages and data. Once once again, we'll turn to a tool purpose built for the chore.

Parsing Wikipedia Articles

Wikipedia runs on a software for building wikis known as MediaWiki. This means that articles follow a standard format that makes programmatically accessing the information inside them uncomplicated. While the text of an commodity may look similar simply a cord, information technology encodes far more information due to the formatting. To efficiently get at this information, we bring in the powerful mwparserfromhell , a library congenital to work with MediaWiki content.

If we pass the text of a Wikipedia article to the mwparserfromhell , we get a Wikicode object which comes with many methods for sorting through the information. For example, the following code creates a wikicode object from an article (about KENZ FM) and retrieves the wikilinks() inside the commodity. These are all of the links that point to other Wikipedia articles:

                import mwparserfromhell                # Create the wiki commodity
wiki = mwparserfromhell.parse(handler._pages[half dozen][ane])
# Find the wikilinks
wikilinks = [ten.title for x in wiki.filter_wikilinks()]
wikilinks[:5]
['Provo, Utah', 'Wasatch Forepart', 'Megahertz', 'Gimmicky striking radio', 'watt']

There are a number of useful methods that can exist practical to the wikicode such as finding comments or searching for a specific keyword. If you lot want to get a clean version of the article text, then phone call:

                wiki.strip_code().strip()                                  'KENZ (94.9 FM,  " Power 94.9 " ) is a tiptop 40/CHR radio station dissemination to Salt Lake Metropolis, Utah '                              

Since my ultimate goal was to find all the manufactures most books, the question arises if there is a way to utilize this parser to identify manufactures in a sure category? Fortunately, the answer is yep, using MediaWiki templates.

Article Templates

Templates are standard means of recording information. There are numerous templates for everything on Wikipedia, but the most relevant for our purposes are Infoboxes . These are templates that encode summary information for an commodity. For instance, the infobox for War and Peace is:

Each category of manufactures on Wikipedia, such as films, books, or radio stations, has its ain blazon of infobox. In the case of books, the infobox template is helpfully named Infobox volume. Just as helpful, the wiki object has a method called filter_templates() that allows united states of america to excerpt a specific template from an commodity. Therefore, if we want to know whether an article is well-nigh a book, nosotros can filter it for the book infobox. This is shown beneath:

                # Filter article for volume template
wiki.filter_templates('Infobox book')

If there's a friction match, and so we've found a book! To notice the Infobox template for the category of articles you lot are interested in, refer to the list of infoboxes.

How do we combine the mwparserfromhell for parsing articles with the SAX parser we wrote? Well, we modify the endElement method in the Content Handler to send the lexicon of values containing the title and text of an article to a function that searches the commodity text for specified template. If the function finds an article we want, it extracts data from the article and then returns information technology to the handler. First, I'll show the updated endElement :

                def endElement(cocky, name):
"""Closing tag of chemical element"""
if name == self._current_tag:
self._values[name] = ' '.join(self._buffer)
if name == 'page':
self._article_count += 1
# Send the page to the process article function
book = process_article(**self._values,
template = 'Infobox book')
# If article is a volume suspend to the listing of books
if book:
self._books.append(volume)

Now, once the parser has hitting the end of an commodity, we send the commodity on to the office process_article which is shown below:

Process Article Function

Although I'chiliad looking for books, this function tin can be used to search for any category of article on Wikipedia. Just replace the template with the template for the category (such as Infobox language to find languages) and it will but return the information from articles inside the category.

We can exam this function and the new ContentHandler on 1 file.

                                  Searched through 427481 articles.
Found 1426 books in 1055 seconds.

Allow'southward have a look at the output for ane book:

                books[10]                                  ['War and Peace',
{'proper noun': 'War and Peace',
'author': 'Leo Tolstoy',
'language': 'Russian, with some French',
'country': 'Russia',
'genre': 'Novel (Historical novel)',
'publisher': 'The Russian Messenger (serial)',
'title_orig': 'Война и миръ',
'orig_lang_code': 'ru',
'translator': 'The commencement translation of War and Peace into English language was by American Nathan Haskell Dole, in 1899',
'image': 'Tolstoy - War and Peace - first edition, 1869.jpg',
'caption': 'Front page of War and Peace, first edition, 1869 (Russian)',
'release_date': 'Serialised 1865–1867; book 1869',
'media_type': 'Impress',
'pages': '1,225 (start published edition)'},
['Leo Tolstoy',
'Novel',
'Historical novel',
'The Russian Messenger',
'Series (publishing)',
'Category:1869 Russian novels',
'Category:Epic novels',
'Category:Novels set in 19th-century Russia',
'Category:Russian novels adapted into films',
'Category:Russian philosophical novels'],
['https://books.google.com/?id=c4HEAN-ti1MC',
'https://www.britannica.com/art/English-literature',
'https://books.google.com/books?id=xf7umXHGDPcC',
'https://books.google.com/?id=E5fotqsglPEC',
'https://books.google.com/?id=9sHebfZIXFAC'],
'2018-08-29T02:37:35Z']

For every single book on Wikipedia, we accept the data from the Infobox as a dictionary, the internal wikilinks, the external links, and the timestamp of the near recent edit. (I'1000 concentrating on these pieces of information to build a volume recommendation system for my side by side project). Y'all can modify the process_article function and WikiXmlHandler course to detect whatever information and manufactures you lot need!

If you look at the time to procedure merely ane file, 1055 seconds, and multiply that by 55, you lot get over 15 hours of processing time for all files! Granted, we could only run that overnight, but I'd rather non waste the extra time if I don't have to. This brings u.s.a. to our final technique we'll cover in this project: parallelization using multiprocessing and multithreading.

Running Operations in Parallel

Instead of parsing through the files i at a time, we want to process several of them at once (which is why nosotros downloaded the partitions). Nosotros can exercise this using parallelization, either through multithreading or multiprocessing.

Multithreading and Multiprocessing

Multithreading and multiprocessing are ways to deport out many tasks on a computer — or multiple computers — simultaneously. We many files on disk, each of which needs to be parsed in the same mode. A naive approach would be to parse 1 file at a time, but that is not taking total advantage of our resources. Instead, we utilise either multithreading or multiprocessing to parse many files at the same time, significantly speeding upward the unabridged procedure.

Generally, multithreading works ameliorate (is faster) for input / output bound tasks, such as reading in files or making requests. Multiprocessing works ameliorate (is faster) for cpu-jump tasks (source). For the process of parsing manufactures, I wasn't sure which method would be optimal, so over again I benchmarked both of them with different parameters.

Learning how to set up upward tests and seek out different ways to solve a problem will get you lot far in a information science or whatsoever technical career.

(The code for testing multithreading and multiprocessing appears at the end of the notebook). When I ran the tests, I found multiprocessing was almost ten times faster indicating this process is probably CPU spring (limited).

Processing results (left) vs threading results (right).

Learning multithreading / multiprocessing is essential for making your data science workflows more efficient. I'd recommend this commodity to get started with the concepts. (We'll stick to the built-in multiprocessing library, only you can as well using Dask for parallelization as in this project).

Afterwards running a number of tests, I constitute the fastest way to procedure the files was using 16 processes, ane for each cadre of my reckoner. This means we can process 16 files at a time instead of 1! I'd encourage anyone to exam out a few options for multiprocessing / multithreading and let me know the results! I'm notwithstanding not sure I did things in the best way, and I'm always willing to larn.

Setting Upward Parallelized Code

To run an operation in parallel, we need a service and a set of tasks . A service is simply a role and tasks are in an iterable — such as a list — each of which we send to the function. For the purpose of parsing the XML files, each chore is one file, and the part will have in the file, find all the books, and relieve them to disk. The pseudo-code for the function is below:

                def find_books(data_path, save = True):
"""Find and salvage all the volume articles from a compressed
wikipedia XML file. """
# Parse file for books if save:
# Save all books to a file based on the information path name

The end issue of running this function is a saved list of books from the file sent to the function. The files are saved as json, a auto readable format for writing nested data such as lists of lists and dictionaries. The tasks that we want to ship to this role are all the compressed files.

                # List of compressed files to procedure
partitions = [keras_home + file for file in os.listdir(keras_home) if 'xml-p' in file]
len(partitions), partitions[-one]
(55, '/habitation/ubuntu/.keras/datasets/enwiki-20180901-pages-articles17.xml-p11539268p13039268.bz2')

For each file, we want to send it to find_books to be parsed.

Searching through all of Wikipedia

The final code to search through every commodity on Wikipedia is below:

                from multiprocessing import Pool                # Create a pool of workers to execute processes
pool = Pool(processes = sixteen)
# Map (service, tasks), applies function to each partitioning
results = puddle.map(find_books, partitions)
pool.close()
pool.join()

Nosotros map each task to the service, the part that finds the books (map refers to applying a office to each particular in an iterable). Running with 16 processes in parallel, we can search all of Wikipedia in nether three hours! After running the code, the books from each file are saved on disk in separate json files.

Reading and Joining Files with Multithreading

For exercise writing parallelized code, we'll read the separate files in with multiple processes, this time using threads. The multiprocessing.dummy library provides a wrapper effectually the threading module. This time the service is read_data and the tasks are the saved files on disk:

The multithreaded code works in the verbal same fashion, mapping tasks in an iterable to part. Once nosotros have the list of lists, we flatten it to a single list.

                print(f'Found {len(book_list)} books.')                                  Found 37861 books.                              

Wikipedia has nearly 38,000 articles on books according to our count. The size of the final json file with all the book information is only about 55 MB pregnant nosotros searched through over 50 GB (uncompressed) of total files to find 55 MB worth of books! Given that nosotros are only keeping a limited subset of the book information, that makes sense.

We now take information on every unmarried book on Wikipedia. You lot tin can use the aforementioned code to detect manufactures for any category of your choosing, or alter the functions to search for unlike information. Using some fairly simple Python lawmaking, nosotros are able to search through an incredible amount of information.

Size of Wikipedia if printed in volumes (Source).

Conclusions

In this article, we saw how to download and parse the entire English language linguistic communication version of Wikipedia. Having a ton of information is not useful unless nosotros tin brand sense of it, and then we adult a prepare of methods for efficiently processing all of the manufactures for the information we need for our projects.

Throughout this project, we covered a number of important topics:

  1. Finding and downloading data programmatically
  2. Parsing through data in an efficient manner
  3. Running operations in parallel to get the virtually from our hardware
  4. Setting upwardly and running benchmarking tests to discover efficient solutions

The skills developed in this projection are well-suited to Wikipedia data but are also broadly applicative to any information from the web. I'd encourage you to apply these methods for your ain projects or try analyzing a different category of articles. There's enough of data for everyone to do their own project! (I am working on making a volume recommendation organization with the Wikipedia articles using entity embeddings from neural networks.)

Wikipedia is an incredible source of human-curated information, and we now know how to use this awe-inspiring achievement past accessing and processing it programmatically. I look forrad to writing about and doing more Wikipedia Data Science. In the meantime, the techniques presented here are broadly applicable so become out there and find a trouble to solve!

Every bit e'er, I welcome feedback and constructive criticism. I can exist reached on Twitter @koehrsen_will or on my personal website at willk.online.

How Much Data Is All Of Wikipedia,

Source: https://towardsdatascience.com/wikipedia-data-science-working-with-the-worlds-largest-encyclopedia-c08efbac5f5c

Posted by: motleywillynat81.blogspot.com

0 Response to "How Much Data Is All Of Wikipedia"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel