Beautiful Soup is: a PythonHTML/XML parser designed for quick turnaround projects like screen-scraping. I’m finding it extremely useful as an aid for turning static HTML sites in to dynamic, database driven sites. For example, scraping the desired html data, dumping it into a CSV file, and importing it into MySQL.
In this particular example, we have a bunch of header files, and some are followed by an unordered list full of links. I wanted to:
- Identify if a list was preceded by a header.
- If we found one, scrape the link, link text, identify if it is a local PDF file or external link, and if it’s an external link, grab the page title (for use in the link’s title text).
- If the link returned a 404 Error, make note, but don’t put it into the…
- CSV file, which will be created by the Python dictionary created in the previous steps.
- I also wrote a function that strips out new line characters – handy.
Here’s the raw HTML that I’m dealing with (excuse the broken images, we don’t need them here anyways).
You can see we have 3 link categories (based on the headers):
- magazine articles
- web and newsletter writing
- promotion tools
This script I wrote creates a dictionary with those values as keys, then creates a list full of tuples for the corresponding links. In the final stages, it outputs all valid data into a CSV file. (Note that the csv file will be created in the same directory that the script was run from.) And don’t forget, if you run it from the command line like so, you’ll get an interactive prompt that will let you experiment with the data that was generated:
python -i ogm-samples.py
I hope you find it useful! Please leave any comments, suggestions, bugs, and/or improvements below – cheers.