BucketList App: Getting the Data (part 1)

This article is part of the series on building a Bucket List App using NYT’s “places to go” lists paired with open source data from Wikivoyage and Pixabay.

This section will describe how to get the data that will be used in the application. If you plan on making an app using open source data, you might find some of these techniques useful.

Copy NYT Lists into Google Sheets

The first step was to copy the lists into a Google Spreadsheets. I used the importxml functionality with Google Sheets to quickly extract the lists of places. Importxml is a must-have simple data scraping tool and Distilled.net has great tutorial on how to use it.

Here’s a quick example of how I did it for 2011:

=importxml(“https://www.nytimes.com/2011/01/09/travel/09where-to-go.html”,”//strong“)

Get Pixabay Images for Each Location

Next, I needed to get images of the places to use in the application. The New York Times lists have these beautiful accompanying images and embedded videos but… we can’t use these. Thankfully, Pixabay, an open source image repository, has a straightforward API. I queried the Pixabay API using importjson directly within Google Sheets.

=transpose(importjson(CONCATENATE(“https://pixabay.com/api/?key=3111109-acd0034c9055b6da0366ff7c1&q=”,A2,”&image_type=photo&pretty=true“),”/hits/webformatURL”,”noHeaders”))

This gave me a list of images that matched that query (e.g. “Iceland”, “London”). However, the Pixabay images expire a couple hours later. I can’t find out why they do this in their docs. But I imagine it is to prevent hotlinking. They don’t want people hammering their servers by hosting images all over the place. You are meant to download the images yourself and self host. Hence the expiring image links.

Host Pixabay images on Cloudinary

Pixabay’s expiring image links presented a challenge. I could have come up with a script to download each image using something like requests and then upload it to S3 or Google Cloud. But I wanted something simpler. With a bit of searching, I discovered a service called Cloudinary that allows automatic uploading and hosting of images that are prefixed with their service. I set the auto-upload feature up within Cloudinary (simple as mapping to a folder) and then I changed Pixabay urls to Cloudinary urls:

I changed this:

https://pixabay.com/get/e832b0062ff3033ed1584d05fb1d4597ea74e2dc04b014429cf4c67aa2ecb0_640.jpg

To this:

https://res.cloudinary.com/kpennell/image/upload/pixabay/e832b0062ff3033ed1584d05fb1d4597ea74e2dc04b014429cf4c67aa2ecb0_640.jpg

I had to open these Cloudinary urls within new tabs (I could have also used a http service to do this). And now I had 2000+ images of my travel destinations hosted on Cloudinary.

Get WikiVoyage Data

This bucket list travel application will feature a list of images with accompanying descriptions of the places and links to more info. Wikivoyage is an excellent open source travel guide that I can use for this. Wikivoyage is owned/maintained by Wikimedia and has a great API. As much as I also like Wikitravel, they didn’t have a straightforward way to get their data via API search. Searching the Wikivoyage API was fairly easy:

First, I needed to run a search to get what they call placeids for different locations/pages.

A search like this will get these placeids and snippets from the wikivoyage page (i.e. The first sentence in the guide):

Here’s an example:

https://en.wikivoyage.org/w/api.php?action=query&list=search&srwhat=text&srsearch=Andermatt,%20Switzerland&format=json&prop=info

This gives me the top pageid of “1127” for Andermat, Switzerland (featured in the 2014 list). It also gives me a snippet that reads “<span class=\”searchmatch\”>Andermatt</span> is a town and ski resort in the canton of Uri, <span class=\”searchmatch\”>Switzerland</span>. It is located at a crossroad in the Alps with great historical importance.”

Next, I needed to take the top placeid (which is right 95% of the time) and get the url for the page (so users can visit it to learn more). This was done with the call below for Andermat (aka pageid 1127):

https://en.wikivoyage.org/w/api.php?action=query&prop=info&pageids=1127&inprop=url&format=json

I imported and parsed all of the results straight into Google Sheets, again, using importjson.

Geocode the Data

Finally, I needed latitude and longitude for each location so that I can put them on an AirBnb-like map interface. I described how to geocode easily using Google’s geocoder here.

Final Data

My final data can be found (and forked!) here: https://docs.google.com/spreadsheets/d/1oOJQttPMHope_hkQ2RC7Epzr5ZmORtJEs2-Zo5mURYw/edit?usp=sharing

This is what it ended up looking like:

Now I’m ready to build an application that can display, filter, and otherwise consume this awesome travel data!

 

Kyle Pennell
Kyle Pennell is a marketer and writer with a variety of technical chops (JavaScript, Google Sheets/Excel, WordPress, SEO). He loves clear and helpful copy and likes automating things and making web apps (mostly maps). Found at kyleapennall.com and many other digital spaces