Spam emails are annoying, dangerous, and even fraudulent. To protect users, email providers use AI-powered spam filters that automatically detect and block unwanted emails.
- How Spam Email Filtering Works? 🧐📩
- Install Required Libraries 📦
- Import Libraries & Load Data 🗂️
- Preprocess the Text Data 📝🔍
- Convert Text Data into Numerical Features 🔢
- Split Data for Training & Testing 🎯
- Train the Spam Classifier 🚀📩
- Test the Spam Filter on New Emails 📬
- Save & Load the Model for Future Use 💾
- Improve the Spam Classifier 🚀
- Real-World Applications of Spam Detection 🌍
- Conclusion 🎯🏆
How Spam Email Filtering Works? 🧐📩
- ✅ Ham (Legitimate Email) – Important, useful messages.
- ❌ Spam (Unwanted Email) – Promotional, phishing, or malicious emails.
🔹 Techniques Used in Spam Detection
- 📌 Keyword-Based Filtering – Detects words like “lottery”, “free money”, “urgent”.
- 📌 Machine Learning (ML) Models – Learn from past emails to classify new ones.
- 📌 Bayesian Filtering – Calculates the probability of an email being spam.
- 📌 Deep Learning (LSTMs, Transformers) – Advanced AI models for spam detection.
Install Required Libraries 📦
pip install numpy pandas scikit-learn nltkImport Libraries & Load Data 🗂️
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, classification_report🔹 Load the Dataset
url = "https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms-spam-collection.csv"
df = pd.read_csv(url, encoding='latin-1', names=['label', 'message'])
df['label'] = df['label'].map({'ham': 0, 'spam': 1})
df.head()Preprocess the Text Data 📝🔍
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
ps = PorterStemmer()
def preprocess_text(text):
text = text.lower()
text = re.sub(r'\W', ' ', text)
words = text.split()
words = [ps.stem(word) for word in words if word not in stop_words]
return ' '.join(words)
df['clean_message'] = df['message'].apply(preprocess_text)
df.head()Convert Text Data into Numerical Features 🔢
tfidf_vectorizer = TfidfVectorizer(max_features=5000)
X = tfidf_vectorizer.fit_transform(df['clean_message']).toarray()
y = df['label']Split Data for Training & Testing 🎯
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)Train the Spam Classifier 🚀📩
model = MultinomialNB()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy * 100:.2f}%")Test the Spam Filter on New Emails 📬
def predict_spam(email_text):
processed_email = preprocess_text(email_text)
email_vector = tfidf_vectorizer.transform([processed_email]).toarray()
prediction = model.predict(email_vector)[0]
return "Spam" if prediction == 1 else "Not Spam"
test_email = "Congratulations! You have won a $1000 Walmart gift card. Claim now!"
print(f"Email: {test_email} -> Prediction: {predict_spam(test_email)}")Save & Load the Model for Future Use 💾
import joblib
joblib.dump(model, "spam_classifier.pkl")
joblib.dump(tfidf_vectorizer, "tfidf_vectorizer.pkl")
loaded_model = joblib.load("spam_classifier.pkl")
loaded_vectorizer = joblib.load("tfidf_vectorizer.pkl")Improve the Spam Classifier 🚀
- ✅ Using Deep Learning (LSTMs, Transformers) for better accuracy.
- ✅ Expanding Training Data by using real-world spam emails.
- ✅ Applying Additional NLP Techniques like lemmatization instead of stemming.
- ✅ Combining Multiple Models (Ensemble Learning) to improve classification.
Real-World Applications of Spam Detection 🌍
- 📧 Email Security – Protects users from phishing attacks.
- 📱 SMS Spam Filtering – Identifies spam texts on mobile phones.
- 🔒 Cybersecurity – Detects fraudulent messages and scam attempts.
- 🤖 Chatbot Moderation – Blocks inappropriate or harmful messages.
Conclusion 🎯🏆
We successfully built a machine learning-based spam filter using:
- ✅ Natural Language Processing (NLP) for text preprocessing.
- ✅ TF-IDF for feature extraction.
- ✅ Naïve Bayes classifier for accurate spam detection.
This AI-powered model automates spam filtering, keeping inboxes clean and users safe! 🚀
🔹 Next Step: Try deploying this model into a real-time email filtering system!


