Python Basics with a Web Scraper
HISTORY
Python, a high-level programming language, was conceived in the late 1980s by Guido van Rossum, a Dutch computer scientist. Its development began as a hobby project, with the aim of creating a successor to the ABC language. Guido wanted a language that was both powerful and easy to read, with a syntax that emphasized code readability and clarity.
In February 1991, Python's first version, 0.9.0, was released. The language's name was inspired by Guido's fondness for Monty Python's Flying Circus, rather than the snake of the same name. Python's design philosophy, known as the "Zen of Python," emphasizes simplicity, readability, and explicitness, which has contributed to its popularity among developers.
Python gained traction steadily throughout the 1990s and early 2000s, buoyed by its simplicity, versatility, and an active community of developers. The release of Python 2.0 in 2000 introduced features like list comprehensions and garbage collection, further enhancing its appeal.
One of Python's key milestones was the release of Python 3.0 (also known as Python 3000 or Py3k) in 2008. Python 3 introduced backward-incompatible changes to address long-standing issues and inconsistencies, aiming to create a cleaner, more consistent language. Despite initial resistance from some users due to the backward compatibility break, Python 3 adoption has since grown, encouraged by the end-of-life announcement for Python 2 in 2020.
Today, Python is one of the most widely used programming languages, known for its simplicity, readability, and versatility. It is extensively used in various domains, including web development, data analysis, artificial intelligence, scientific computing, and more. Python's rich ecosystem of libraries and frameworks, combined with its vibrant community, continues to fuel its growth and influence in the world of programming.
SYNTAX
Python, unlike many other languages, does not use a character to terminate each command. This eliminates the syntax error for one forgotten semicolon. Python reads its code line by line top to bottom. The start of a new line tells the Python interpreter that we are starting a new command. Trying to call two commands on the same line will result in a syntax error as well as one command on two lines. There are special cases and ways around this but we won’t be covering that here. For the most part, in python, with a new line you can expect a new command.
If you are familiar with other languages you may notice the lack of grouping characters. For example in C++ everything in a function is grouped within {curly brackets}. Python uses indention via tabs or spaces to communicate to the interpreter what is within a function. Keep in mind that though tabs and spaces can be used the same it is very important to remain consistent. A developer that uses spaces must always use the same number of spaces per indention otherwise you could confuse the interpreter. Personally, I find it easiest to just use tab.
The code snip above is the definition of a function. [17]We know this because of the def keyword followed by the function name and parentheses (). Within the parenthesis are the parameters we will be feeding the function. Everything indented after the colon(:) are commands within the function. [18 and 19]We start the function with defining local variables and giving them the type set with the set() function. Notice that the indentation of all of these commands are the same. [20]The next line starts with a # telling us that this is a comment and wont be interpreted. Comments are simply for the developer to help with readability.
[25]Further down the code we come to a command that starts with for. This is the start of a for loop. Like a function, loops will have commands that are within the loop and, like a function, this is communicated to the interpreter via a colon(:) followed by indented code. Within this for loop is another loop, the if else loop. Just like with the for loop the if else loop is indented further. Consistency in indentation is very important in python. Consistent indentation in python becomes essential to communicate to the interpreter the developers intent.
As you can start to see, pythons method of grouping makes for a very readable source code.
Python Web Scraper
Now that I have provided some basics to the python syntax lets look at the web scraper. This scraper was written to create a CSV file of emails found on a target website. We start a shebang and importing the modules we will be using. The shebang will tell the computer that the following code needs to be executed using the python3 interpreter and the modules will provide us with the functions we will be using in the web scraper.
Next we will define the user functions used in our program. Functions are groups of code that take in data, do something with it and produce an effect or return data. When the program runs these functions will be saved in memory ready for use when called on. You can think of this like getting your tools out and set up ready for work.
This function takes in a URL and creates a list of all the links on the website the URL links too. This URL would be the homepage and the list of links would be any pages that can be accessed from the homepage.
This function takes in a list of strings which are email address that have text still attached to them from the website. This function then takes these email addresses and removes anything after .com and creates a new list of cleaned emails.
This function takes the list of URLs from the first function and creates a list of emails from the websites. It then uses the second function to clean any leftover text from the end of the emails.
This function is used to write the list of emails to a CSV file. This program can be run on multiple websites and have the results sent to the same CSV file. The program will simply just add the new results onto the end of the file resulting in a larger file of emails.
The last part of the program is the main body of the program. This portion of the program is what asks for input from the user and orchestrates the use of the above functions to provide the desired result. Many of these groups of code could be put in their own function to simplify this portion of code. This code is by no means an example of best practices but just an example of how powerful a crude bit of code can be.
This section of the code asks the user for some input. This input consists of the target website, if the program is going to look for emails on just the first page or if it will also look for emails on any page linked to that page. The name of the final CSV file is also input by the user here. The program then uses this information to make decisions on how and when to use the above functions to produce the desired file containing the list of emails. If there are not emails found the program will tell you that no emails where found and exit.
Security
As you can see it is quite simple to write a program that will crawl across a website and record all emails found on it. This can be done for all kinds of information including phone numbers and mailing addresses. Providing your email on your personal or small business website allows potential customers to contact you but having your email listed publicly also opens up the potential for all kinds of email based threats from simple email spam to phishing attacks. For this reason think carefully about what email or phone number you post publicly on your website or better yet us a contact me form that masks your email from public view. This can save you time and potential lost customers buried by the mountain of automated spam emails.
That’s all for now!
Stay safe online and happy coding!