Querying Wikipedia Data

Posted by U.B.Pavanaja at Oct 21, 2016 02:50 PM |
Recently I wrote a blog about the stub article length of Wikipedia articles. I mentioned the difference in actual number of characters and the number of bytes used to define stub articles between English and Indian language Wikipedias. One can open any language Wikipedia, type Special:ShortPages in the search box to get the list of articles which have less than 2048 bytes.

Originally published blog can be accessed here.


However, as already mentioned in that blog, the number of bytes for Indian languages to be considered as stub should be actually 2048*3 = 6144 bytes employing the same criteria. How to find the list of articles fulfilling this condition?

This brings us to the topic of querying Wikipedia data. Wikimedia Foundation Labs has put up a website wherein one can run SQL queries on Wikimedia data. The URL of the website is quarry.wmflabs.org. When we open the website, we get a textbox wherein one can type the SQL query which will run on Wikimedia data. In this example I will consider Wikipedia only. But the queries can be run on the data of other Wikimedia projects like Wikisource, Wikidata, Wiktionary, etc.

One has to login with his/her Wikimedia login. After logging the SQL query can be typed in the textbox and the Submit Query button has to be clicked. The result of execution of the query on Wikimedia data will be displayed. In this blog I will throw more light on Kannada Wikipedia. The database for Kannada Wikipedia is called knwiki_p. Complete list of databases can be obtained by running the SQL query “show databases”.

To get the list of tables in Kannada Wikipedia, the following SQL queries have to be executed:

use knwiki_p;
show tables;

To know the scheme of any table, run the query desc <tablename>;. For example, to know the details of the table by name, issue the query desc page;. The fields which are of importance in the current case is page_title and page_len. The following query will list all articles in Kannada Wikipedia which are having less than 6144 bytes.

use knwiki_p;
select page_title, page_len
from page where page_len < ‘6144’ and page_namespace = 0 and page_is_redirect = 0 order by page_len ;

Some other useful queries are listed below:

QueryWhat it does
Select Count(*) from page where page_namespace = 0 and page_is_redirect =0; Number of articles without redirect
Select Count(*) from page where page_namespace = 0 and page_is_redirect =0 and page_len < 6144; Number of articles which are having bytes less than 6144
select * from user where user_name Like “P%”; List all users whose username starts with letter “P”
select user_id, user_name, user_editcount  from user where user_editcount >3000 order by user_editcount desc; List all users with editcount more than 3000
SELECT

page_namespace,

page_title,

page_len

FROM page

WHERE page_len > 175000

AND page_title NOT LIKE “%/%”

ORDER BY page_namespace ASC;

List of long articles (articles having  bytes more than 175000)
SELECT rc_title as title, rc_comment as comments, count(*) as Edits

FROM recentchanges

WHERE rc_namespace = 0

GROUP BY 1 ORDER BY 3 DESC

LIMIT 100;

Most edited 100 pages during past one month
SELECT log_title, COUNT(*) FROM logging WHERE log_type=”thanks” GROUP BY log_title ORDER BY COUNT(*) DESC LIMIT 100; Who have been thanked most

Useful links -

  1. Details about Quarry
  2. MySQL queries help

Document Actions