Recreate the scrapping script using shell script

This commit removes the old 'scrap_pokemon.py' script in favor of
the new 'scrap_data.sh' script. Now we don't need python anymore!
This commit is contained in:
Lucas Possatti
2015-06-24 16:56:42 -03:00
parent 640e61366b
commit c6c3e1e9c2
2 changed files with 53 additions and 33 deletions

53
scrap_data.sh Executable file
View File

@@ -0,0 +1,53 @@
#!/bin/sh
#
# This script scraps some pokémon pictures from Bulbapedia.
#
bulbapedia_page_url="http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_Kanto_Pok%C3%A9dex_number"
bulbapedia_page_name="bulbapedia.html"
scrap_folder="`pwd`/scrapped-data"
# Make sure the directory for the scrapped data is there.
mkdir -p "$scrap_folder"
# Download the bulbapedia page if it doesn't already.
if [ ! -e "scrapped-data/bulbapedia.html" ]; then
echo " > Downloading '$bulbapedia_page_url' to '$scrap_folder/$bulbapedia_page_name' ..."
wget "$bulbapedia_page_url" -O "$scrap_folder/$bulbapedia_page_name" -q
echo " > Downloaded."
fi
# Dear furure me,
#
# If you are in need to maintain this part of the code... I am
# realy sorry for you (T.T). This was the best I could do... But
# I will try to explain things here a little bit.
# 'cat' will read the file and pipe its output to 'sed'. 'sed'
# will filter the html searching for the Pokémon name and its
# image url. 'sed' will output the Pokémons in this format:
# "<POKEMON_NAME>=<POKEMON_URL>".
# Then, the output of 'sed' goes into the while loop, which will
# read the output one line at a time. Within the while loop, I
# extract the pokemon name and the url from the read line. And
# then, it just downloads the url to a file.
# Again... I'm sorry for all the trouble. But I hope you will
# grow stronger and may be able to turn this code into something
# more readable.
#
# Kind regards,
# Yourself from the past.
cat "$scrap_folder/$bulbapedia_page_name" | \
sed -nr 's;^.*<img alt="(.*)" src="(http://cdn.bulbagarden.net/upload/.*\.png)" width="40" height="40" />.*$;\1=\2;p' | \
while read line
do
pokemon_name="${line%=*}"
pokemon_url="${line#*=}"
# Unescape HTML characters... Damn "Farfetch&#39;d".
pokemon_name=$(echo "$pokemon_name" | sed "s/&#39;/'/")
echo " > Downloading '$pokemon_name' from '$pokemon_url' to '$scrap_folder/$pokemon_name.png' ..."
wget "$pokemon_url" -O "$scrap_folder/$pokemon_name.png" -q
done

View File

@@ -1,33 +0,0 @@
#!/usr/bin/python3
import re
import sys
import html
import urllib.request
# Load the pokemon sprites page.
with open('scrapped-data/bulbapedia.html', 'r') as page:
html_page = "".join(page.readlines());
# Find all pokemon name and image urls.
image_regex = r'<img alt="(.*)" src="(http://cdn.bulbagarden.net/upload/.*\.png)" width="40" height="40" />'
all_pokemon = re.findall(image_regex, html_page)
# Save the image of each pokemon.
for pokemon in all_pokemon:
# Unpack the tuple data.
name, image_url = pokemon
# Clean HTML escape sequences in the name
name = html.unescape(name)
# Set file path for the image.
file_path = './scrapped-data/' + name + '.png'
# Tell the user what we are doing here.
print('Downloading "{}" image to "{}"...'.format(name, file_path))
# Download the image.
with open(file_path, 'wb') as pokemon_file:
with urllib.request.urlopen(image_url) as pokemon_image:
pokemon_file.write(pokemon_image.read())