unsubscribe_html

Dependencies

This Python script has the following dependencies:

  • imapclient: library for accessing Outlook inbox via IMAP
  • email: Python standard library for working with emails
  • os: Python standard library for interacting with the operating system
  • dotenv: library for loading environment variables from a .env file
  • bs4 (BeautifulSoup): library for extracting information from HTML pages
  • urllib: Python standard library for handling URLs

Environment Variables

This script uses the following environment variables:

  • EMAIL_ADDRESS: email address of the Outlook account to be used
  • EMAIL_PASSWORD: password of the Outlook account to be used
  • IMAP_SERVER: address of the Outlook IMAP server
  • IMAP_SSL: indicates whether to use an SSL/TLS connection to connect to the IMAP server (True) or not (False).
  • UNSEEN (optional): indicates whether to search only for unread emails (True) or all emails (False). If this variable is not defined or its value is not "True", all emails will be searched.
  • FOLDERS_EMAIL: comma-separated list of folders to be checked for emails containing the unsubscribe links.
  • KEYWORDS_FILE: path to the text file containing the keywords to be searched for in the email links. Each keyword should be on a separate line.

The environment variables are loaded from a .env file in the root of the project.

Functioning

Searching for emails

The script connects to the Outlook account and goes through the folders indicated in FOLDERS_EMAIL, searching for unread emails (if UNSEEN is "True") or all emails (otherwise).

For each email found, the script checks if the email content is HTML. If it is, it extracts the links contained in the HTML using the BeautifulSoup library. Then, it checks if any of these links contains any of the keywords present in the file indicated in KEYWORDS_FILE.

If any link contains one of the keywords, the script stores the email information (sender and subject) and the link in a dictionary and adds that dictionary to a list.

Generating the HTML file

With the list of unsubscribe link information in hand, the script generates an HTML file from a template located in templates/template.html.

The HTML template contains a table with information about the sender, subject, and unsubscribe link. The script fills the rows of this table with the information stored in the list of dictionaries.

The generated HTML file is saved in unsubscribe_links.html.

Usage

Before running the script, it is necessary to define the environment variables in the .env file and create a keywords.txt file with the keywords that should be searched for in the email links.

The unsubscribe_links.html file will be generated in the same folder as the script, containing the table with the unsubscribe links found in the emails.

Steps to execute the script:

  1. Rename the .env.sample file to .env;
mv .env.sample .env
  1. Fill in the information in the .env file with the necessary Outlook credentials and environment variables;

Example .env file:

EMAIL_ADDRESS=johndoe@outlook.com
EMAIL_PASSWORD=mypassword
IMAP_SERVER=outlook.office365.com
IMAP_SSL=True
UNSEEN=True
FOLDERS_EMAIL=Inbox,Sent Items
KEYWORDS_FILE=keywords.txt
  1. Create a keywords.txt file in the root of the project with the keywords to be searched for in the email links;

Example keywords.txt file:

unsubscribe
opt-out
cancel subscription
unsubscribe from this list
update your preferences
  1. Install the project dependencies using the command pip install -r requirements.txt in the terminal;

  2. Save the content below in a file called script.py

import imapclient
import email
import os
from dotenv import load_dotenv
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from email.header import decode_header

# carrega as variáveis de ambiente do arquivo .env
load_dotenv()

# Insira suas credenciais do Outlook aqui
email_address = os.getenv('EMAIL_ADDRESS')
email_password = os.getenv('EMAIL_PASSWORD')

# Leia as palavras-chave de um arquivo de texto
with open('keywords.txt', 'r', encoding='utf-8') as file:
    keywords = [line.strip().lower() for line in file]

unseen_only = os.getenv('UNSEEN')
if unseen_only is None or unseen_only.lower() == 'true':
    search_criteria = ['UNSEEN']
else:
    search_criteria = ['ALL']

# Conectar à caixa de entrada do Outlook
imap_server = os.getenv('IMAP_SERVER')
imap_ssl = os.getenv('IMAP_SSL')
client = imapclient.IMAPClient(imap_server, ssl=imap_ssl)
client.login(email_address, email_password)
folders = os.getenv('FOLDERS_EMAIL').split(',')
links_info = []

for folder in folders:
    # Conectar à pasta atual
    client.select_folder(folder, readonly=True)

    # Buscar e-mails não lidos
    messages = client.search(search_criteria)

    # Iterar pelos e-mails
    for msg_id in messages:
        msg_data = client.fetch(msg_id, ['RFC822'])
        msg = email.message_from_bytes(msg_data[msg_id][b'RFC822'])

        # Verificar se o e-mail é HTML
        if msg.is_multipart():
            for part in msg.walk():
                if part.get_content_type() == 'text/html':
                    html_content = part.get_payload(decode=True)
                    break
        else:
            if msg.get_content_type() == 'text/html':
                html_content = msg.get_payload(decode=True)
            else:
                continue

        # Extrair links usando BeautifulSoup
        soup = BeautifulSoup(html_content, 'html.parser')
        for link in soup.find_all('a', href=True):
            href = link['href']
            text = link.text.lower()

            # Verificar se alguma das palavras-chave está presente no texto do link
            if any(keyword in text for keyword in keywords):
                # Evitar links repetidos
                if href not in [info['link'] for info in links_info]:
                    # Decodificar o campo "From"
                    decoded_from = decode_header(msg['From'])
                    from_email = ''.join([str(part, encoding or 'utf-8') if isinstance(part, bytes) else part for part, encoding in decoded_from])

                    # Decodificar o campo "Subject"
                    decoded_subject = decode_header(msg['Subject'])
                    #subject = ''.join([str(part, encoding or 'utf-8') if isinstance(part, bytes) else part for part, encoding in decoded_subject])
                    subject = ''.join([str(part, encoding or 'utf-8') if isinstance(part, bytes) else part for part, encoding in decoded_subject if encoding != 'unknown-8bit'])

                    # Armazenar informações em um dicionário
                    link_info = {'from': from_email, 'subject': subject, 'link': href}
                    links_info.append(link_info)
                    # print(f'Link encontrado: {href}')

# Encerrar a conexão com o servidor de e-mail
client.logout()

# Gerar conteúdo HTML
template_path = os.path.join(os.path.dirname(__file__), 'templates', 'template.html')

with open(template_path, 'r') as f:
    html_template = f.read()

# Gerar linhas da tabela
table_rows = ''
for info in links_info:
    row = f'<tr><td>{info["from"]}</td><td>{info["subject"]}</td><td><a href="{info["link"]}">Clique aqui para descadastrar</a></td></tr>'
    table_rows += row

# Combinar modelo HTML e linhas da tabela
html_content = html_template.format(table_rows=table_rows)

# Salvar conteúdo HTML em um arquivo
with open('unsubscribe_links.html', 'w', encoding='utf-8') as file:
    file.write(html_content)

print('Arquivo HTML gerado: unsubscribe_links.html')
  1. Run the script using the command python unsubscribe_links.py in the terminal;

  2. Check if the unsubscribe_links.html file was successfully generated in the root of the project.