Recreate the scrapping script using shell script

This commit removes the old 'scrap_pokemon.py' script in favor of the new 'scrap_data.sh' script. Now we don't need python anymore!
2015-06-24 16:56:42 -03:00
parent 640e61366b
commit c6c3e1e9c2
2 changed files with 53 additions and 33 deletions
--- a/scrap_data.sh
+++ b/scrap_data.sh
@@ -0,0 +1,53 @@
+#!/bin/sh
+
+#
+# This script scraps some pokémon pictures from Bulbapedia.
+#
+
+bulbapedia_page_url="http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_Kanto_Pok%C3%A9dex_number"
+bulbapedia_page_name="bulbapedia.html"
+scrap_folder="`pwd`/scrapped-data"
+
+# Make sure the directory for the scrapped data is there.
+mkdir -p "$scrap_folder"
+
+# Download the bulbapedia page if it doesn't already.
+if [ ! -e "scrapped-data/bulbapedia.html" ]; then
+	echo "  > Downloading '$bulbapedia_page_url' to '$scrap_folder/$bulbapedia_page_name' ..."
+	wget "$bulbapedia_page_url" -O "$scrap_folder/$bulbapedia_page_name" -q
+	echo "  > Downloaded."
+fi
+
+# Dear furure me,
+#
+#   If you are in need to maintain this part of the code... I am
+# realy sorry for you (T.T). This was the best I could do... But
+# I will try to explain things here a little bit.
+#   'cat' will read the file and pipe its output to 'sed'. 'sed'
+# will filter the html searching for the Pokémon name and its
+# image url. 'sed' will output the Pokémons in this format:
+# "<POKEMON_NAME>=<POKEMON_URL>".
+#   Then, the output of 'sed' goes into the while loop, which will
+# read the output one line at a time. Within the while loop, I
+# extract the pokemon name and the url from the read line. And
+# then, it just downloads the url to a file.
+#   Again... I'm sorry for all the trouble. But I hope you will
+# grow stronger and may be able to turn this code into something
+# more readable.
+#
+#                                                     Kind regards,
+#                                           Yourself from the past.
+
+cat "$scrap_folder/$bulbapedia_page_name" | \
+sed -nr 's;^.*<img alt="(.*)" src="(http://cdn.bulbagarden.net/upload/.*\.png)" width="40" height="40" />.*$;\1=\2;p' | \
+while read line
+do
+	pokemon_name="${line%=*}"
+	pokemon_url="${line#*=}"
+
+	# Unescape HTML characters... Damn "Farfetch&#39;d".
+	pokemon_name=$(echo "$pokemon_name" | sed "s/&#39;/'/")
+
+	echo "  > Downloading '$pokemon_name' from '$pokemon_url' to '$scrap_folder/$pokemon_name.png' ..."
+	wget "$pokemon_url" -O "$scrap_folder/$pokemon_name.png" -q
+done
--- a/scrap_pokemon.py
+++ b/scrap_pokemon.py
@@ -1,33 +0,0 @@
-#!/usr/bin/python3
-
-import re
-import sys
-import html
-import urllib.request
-
-# Load the pokemon sprites page.
-with open('scrapped-data/bulbapedia.html', 'r') as page:
-	html_page = "".join(page.readlines());
-
-# Find all pokemon name and image urls.
-image_regex = r'<img alt="(.*)" src="(http://cdn.bulbagarden.net/upload/.*\.png)" width="40" height="40" />'
-all_pokemon = re.findall(image_regex, html_page)
-
-# Save the image of each pokemon.
-for pokemon in all_pokemon:
-	# Unpack the tuple data.
-	name, image_url = pokemon
-
-	# Clean HTML escape sequences in the name
-	name = html.unescape(name)
-
-	# Set file path for the image.
-	file_path = './scrapped-data/' + name + '.png'
-
-	# Tell the user what we are doing here.
-	print('Downloading "{}" image to "{}"...'.format(name, file_path))
-
-	# Download the image.
-	with open(file_path, 'wb') as pokemon_file:
-		with urllib.request.urlopen(image_url) as pokemon_image:
-			pokemon_file.write(pokemon_image.read())