HTML and XML are standard formats that are used to display information on webpages. They consist of readable text files, with many instructions on formatting. It is a common task for data miners to “scrape” data from webpages. You can use regular expressions for that, but if the webpages are reasonably well-formatted, the “Beautiful Soup” module may help you out.
The Beautiful Soup module is named bs4
in Python (naturally, bs3
came before it, and it may get more updates later). It contains the
BeautifulSoup
class that you can use to load and interpret HTML and
XML files. bs4
is not part of the standard Python package; you have to
install it separately, which is quite a hassle, unless you use a tool
called pip
which comes standard with Python 3.
There are alternative modules that can ease the pain of web scraping for
you, notably lxml
, but Beautiful Soup seems to be the most popular.
Since all such modules require separate installations, I will not discuss them here. I only wish to indicate that if you need to do web scraping (and it is likely you have to do that at some point), you should check out some of the standard tools available for that before you delve into eccentric regular expression-design.