Develop a Spam Email Filter Using Machine Learning 📧🤖

Rajil TL
4 Min Read
Disclosure: This website may contain affiliate links, which means I may earn a commission if you click on the link and make a purchase. I only recommend products or services that I personally use and believe will add value to my readers. Your support is appreciated!

Spam emails are annoying, dangerous, and even fraudulent. To protect users, email providers use AI-powered spam filters that automatically detect and block unwanted emails.

How Spam Email Filtering Works? 🧐📩

  • Ham (Legitimate Email) – Important, useful messages.
  • Spam (Unwanted Email) – Promotional, phishing, or malicious emails.

🔹 Techniques Used in Spam Detection

  • 📌 Keyword-Based Filtering – Detects words like “lottery”, “free money”, “urgent”.
  • 📌 Machine Learning (ML) Models – Learn from past emails to classify new ones.
  • 📌 Bayesian Filtering – Calculates the probability of an email being spam.
  • 📌 Deep Learning (LSTMs, Transformers) – Advanced AI models for spam detection.

Install Required Libraries 📦

pip install numpy pandas scikit-learn nltk

Import Libraries & Load Data 🗂️

import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report

🔹 Load the Dataset

url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms-spam-collection.csv"
df = pd.read_csv(url, encoding='latin-1', names=['label', 'message'])
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()

Preprocess the Text Data 📝🔍

nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'\W', ' ', text)
    words = text.split()
    words = [ps.stem(word) for word in words if word not in stop_words]
    return ' '.join(words)

df['clean_message'] = df['message'].apply(preprocess_text)
df.head()

Convert Text Data into Numerical Features 🔢

tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['clean_message']).toarray()
y = df['label']

Split Data for Training & Testing 🎯

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

Train the Spam Classifier 🚀📩

model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")

Test the Spam Filter on New Emails 📬

def predict_spam(email_text):
    processed_email = preprocess_text(email_text)
    email_vector = tfidf_vectorizer.transform([processed_email]).toarray()
    prediction = model.predict(email_vector)[0]
    return "Spam" if prediction == 1 else "Not Spam"

test_email = "Congratulations! You have won a $1000 Walmart gift card. Claim now!"
print(f"Email: {test_email} -> Prediction: {predict_spam(test_email)}")

Save & Load the Model for Future Use 💾

import joblib
joblib.dump(model, "spam_classifier.pkl")
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.pkl")

loaded_model = joblib.load("spam_classifier.pkl")
loaded_vectorizer = joblib.load("tfidf_vectorizer.pkl")

Improve the Spam Classifier 🚀

  • ✅ Using Deep Learning (LSTMs, Transformers) for better accuracy.
  • ✅ Expanding Training Data by using real-world spam emails.
  • ✅ Applying Additional NLP Techniques like lemmatization instead of stemming.
  • ✅ Combining Multiple Models (Ensemble Learning) to improve classification.

Real-World Applications of Spam Detection 🌍

  • 📧 Email Security – Protects users from phishing attacks.
  • 📱 SMS Spam Filtering – Identifies spam texts on mobile phones.
  • 🔒 Cybersecurity – Detects fraudulent messages and scam attempts.
  • 🤖 Chatbot Moderation – Blocks inappropriate or harmful messages.

Conclusion 🎯🏆

We successfully built a machine learning-based spam filter using:

  • ✅ Natural Language Processing (NLP) for text preprocessing.
  • ✅ TF-IDF for feature extraction.
  • ✅ Naïve Bayes classifier for accurate spam detection.

This AI-powered model automates spam filtering, keeping inboxes clean and users safe! 🚀

🔹 Next Step: Try deploying this model into a real-time email filtering system!

Share This Article

Rajil TL is a SenseCentral contributor focused on tech, apps, tools, and product-building insights. He writes practical content for creators, founders, and learners—covering workflows, software strategies, and real-world implementation tips. His style is direct, structured, and action-oriented, often turning complex ideas into step-by-step guidance. He’s passionate about building useful digital products and sharing what works.