National parks around the world
As an avid hiker, some of my favourite trails can be found in provincial and national parks. My hiking goal is to visit as many provincial and national parks as I can. I then got curious as to which country had the most national parks in case I ever went on an international hiking trip. With this in mind, my goal was to plot a choropleth map to illustrate which countries had the most parks, with a secondary goal of plotting each individual national park.
What is a national park?
Each country may have its own methodology to define a national park, but for this project, I settled on the definition used by the International Union for Conservation of Nature (IUCN). This was partly because this allowed me to use a single definition to identify national parks for each country, and partly because there was a Wikipedia page that neatly organized the number of national parks for each country using this definition. According to this page, there are 3,247 national parks in the world. As this webpage had information regarding the number of national parks in each country, all that remained was getting the national parks GPS coordinates.
Scraping for National Park Coordinates
Basic Methodology
From the Wikipedia page mentioned above, most countries had a link to another Wikipedia page that contained the national parks for the respective country. From the country’s Wikipedia page, you could usually find links to each national park found in that country. Within the national park’s Wikipedia page, you could usually find the GPS coordinates of the national park - our ultimate prize.
The workflow seemed simple enough - I could scrape each country’s webpage to first find the name and URL of that country’s national parks, and then scrape each national park’s webpage to get the GPS coordinates. However, after having looked at 15-20 different webpages for different countries, I noticed inconsistent webpage structures. When it comes to web scraping, you are trying to parse through the webpage’s HTML code to extract information from specific HTML tags, attributes and elements. If the webpages were structured similarly, then I could write my web scraping code to look for the same elements in each webpage. However, because of the inconsistent structures, I would have to account for these differences.
Tables and Unordered Lists
In most cases, a country’s national parks were either organized in a table, or an unordered list (an HTML element that usually appears as a bulleted list in a webpage). Occasionally, there would be multiple tables or multiple unordered lists on the webpage. Once, I had the webpages underlying HTML code, in order to get the national park names and URLs, I knew that I would have to look for either a table or unordered list.
Narrowing the Search - The ‘National Park’ Heading
An HTML heading is an HTML element that represents a title that is displayed on the webpage. A common country webpage structure that stuck out was the presence of an HTML heading with the text ‘national park’ in it. Underneath this heading, there was usually a table or unordered list of that country’s national parks. To get the national park names and URLs, I attempted to first look for the ‘national park’ heading, and then look for the first list or table in the HTML code that followed the heading.
Looking for Multiple Tables or Unordered Lists
The above strategy did not account for webpages that had multiple tables or unordered lists. Additional logic was included to check the webpage for additional tables or unordered lists once either was found.
No ‘National Park’ Heading
Occasionally, a country’s webpage did not have a national park heading. In these cases, I added code to return the first table in the webpage, and if a table could not be found, then return the first unordered list in the webpage. When I was exploring different webpages at the beginning, I noticed that there were many webpages that contained multiple unordered lists, but only one list contained the names and URL of national parks. The one list I needed was often not the first unordered list. The first table found in these webpages often was the table I needed. By checking for a table first, I reasoned that I could reduce the chances of finding an unordered list that does not have the information I was looking for.
Groups of irregular webpage structures
Using the strategies outlined above, I was able to get the coordinates for most of the national parks. However, there were still some national parks that were not being accounted for. Looking through which country’s national parks I was having trouble getting coordinates for, I noticed some patterns. I wrote additional code to account for these patterns. Below is a short outline of common patterns that stood out.
There was no national park heading, but there were more than one table with national parks on the webpage.
The unordered list(s) contained all protected areas in addition to national parks.
A national park heading was found, but the table with the national park names and URLs was before the heading instead of after.
The country only had one national park, and the country URL redirects to a webpage for the national park rather than a country webpage that contains a table or list of that country’s national parks.
Missing National Parks
Despite accounting for irregular website structures, there were still some national parks that slipped through the cracks. There are some countries that have unique tables or unordered lists that do not share patterns with any other countries. I will have to write additional code to account for these unique edge cases.
There were also national parks that were missing due to an invalid URL somewhere along the web scraping pipeline. If a country did not have a valid Wikipedia page, then it was not possible to get the coordinates for national parks in that country. There are 90 national parks without coordinates due to invalid country URLs. Likewise, if the national park did not have a valid Wikipedia page, then it was not possible to get the coordinates of that national park. There are 333 national parks that do not have coordinates due to invalid national park URLs. Another source of missing coordinates was due to the coordinates being absent in the national park’s webpage. There were 37 national parks that had missing coordinates in the national park’s webpage.
Too Many Parks?
Upon investigation, it appeared that some country webpages had parks that were designated as a national park using their own national definition rather than the IUCN definition. On other occasions, some webpages listed decommissioned national parks. The scraper did not account for these scenarios. There were other country webpages that listed other protected areas, such as conservation areas, that did not have the national park designation. While an attempt was made to filter out the non-national parks, it was not always successful and a handful of records in the dataset do not fall under the IUCN definition of a national park. The scenarios described above resulted in some countries having more national parks scraped than what was listed on the main Wikipedia page.
Visualizing the Data
Using the strategy outlined above, I was able to scrape 2,836 out of the 3,247 national parks or roughly 87% of available national parks. This figure is a slight overestimation due to the scraper capturing data for decommissioned national parks, and national parks that were defined by a country’s national definition rather than the IUCN definition.
With the data I was able to scrape, I used Tableau to create a choropleth to show which countries had the most national parks, and the map also displays the locations of each national park that was scraped. You can view the visualization below or if you want to see it in fullscreen, click here.
Key Takeaways
Ideally, to get the data we want, it would be much easier if there were files we could download, a database we could query, or interact with the source data via an API. Other times, the data is available on webpages, and we can turn to web scraping as a means to get the data we are looking for.
When you are scraping multiple webpages to collect data and they all differ slightly in terms of the underlying HTML code, you are presented with an interesting (and at times, painful) challenge. It was fun to write code that accounted for different webpage layouts, but this project also highlighted how fragile web scraping can be. Slight differences in webpage layout can cause an error if you do not account for them in your code. I am grateful that the majority of webpages had a similar enough layout, and there were only a few edge cases that needed to be accounted for to gather the data I was looking for.