Building FraudShield AI: When Your Data Is Already Balanced But You SMOTE Anyway

You know you’ve been in the ML game too long when you automatically reach for SMOTE like it’s a magic wand, only to realize your data was already perfectly balanced. It’s like bringing a fire truck to a candle—impressive, but completely unnecessary.

The SMOTE Incident: A Cautionary Tale

I was so deep in “fraud detection mode” that I forgot the first rule of ML: look at your data first. My thought process went something like:

  1. “Fraud detection? That’s imbalanced data!”
  2. “Imbalanced data? That means SMOTE!”
  3. “SMOTE? Let me write a whole data generation pipeline!”

python

# What I wrote (with confidence)
from imblearn.over_sampling import SMOTE

def enhance_dataset(X, y):
    # Because obviously fraud data is always imbalanced, right?
    smote = SMOTE(random_state=42)
    X_resampled, y_resampled = smote.fit_resample(X, y)
    return X_resampled, y_resampled

# What I should have done first
print(f"Original distribution: {np.bincount(y)}")
# Output: [5000, 5000] ... wait what?

The facepalm moment: When my “enhanced” SMOTE dataset started producing models that were suspiciously good at detecting… synthetic fraud patterns. The model was basically learning to recognize SMOTE’s handwriting instead of actual fraud signals.

The Data Directory Discovery

The real comedy started when I found the data/ directory I’d completely forgotten about. It was like finding twenty dollars in your winter coat—surprising and mildly embarrassing.

Turns out I had:

  • data/original/ – The perfectly balanced dataset I’d carefully curated
  • data/synthetic/ – My SMOTE-generated “masterpiece”
  • data/why_did_i_make_this/ – Various experimental abominations

My model was training on Frankenstein data while the pristine original dataset sat there untouched, like fancy china that’s too nice to use.

The Performance Rollercoaster

With SMOTE:

  • Training accuracy: 99.8% (too good to be true? Nah…)
  • Real-world performance: “Why is it flagging every third transaction as fraud?”
  • Business logic: Completely overwhelmed by synthetic patterns

After SMOTE-ectomy:

  • Training accuracy: 98.2% (actually believable)
  • Real-world performance: “Wait, it’s actually working now?”
  • Business logic: Finally making sense again

The funniest part? My carefully engineered SMOTE pipeline was so integrated that removing it broke three different scripts that had become dependent on generating synthetic data that nobody needed.

Docker Networking: The Container Communication Crisis

While I was busy fighting synthetic data battles, my Docker containers were having their own communication issues. It was like building a brilliant fraud detection brain that couldn’t figure out how to make a phone call.

The Great Localhost Misunderstanding:

python

# What I thought would work:
API_URL = "http://localhost:8000"

# What actually worked after two days of debugging:
API_URL = "http://backend:8000"  # Magic Docker DNS!

Turns out containers are like celebrities—they don’t use their real names backstage. localhost inside a container is like trying to call your own phone number and expecting to reach your friend.

The Docker Compose Learning Curve

My docker-compose.yml went through more revisions than my resume:

yaml

# Version 1: The Optimist
services:
  backend: 
    build: .
    ports: 
      - "8000:8000"
  # Wait, how do they talk to each other?

# Version 5: The Realist  
services:
  backend:
    build: ./backend
    ports: 
      - "8000:8000"
    networks:
      - fraud-network
  frontend:
    build: ./frontend
    environment:
      - API_BASE=http://backend:8000  # The magic words!
    networks:
      - fraud-network

networks:
  fraud-network:
    driver: bridge

The moment I discovered Docker networks was like finding the secret menu at In-N-Out Burger—everything suddenly made sense.

The Health Check Comedy

My health checks evolved from optimistic to actually useful:

python

# Phase 1: Naive optimism
@app.get("/health")
def health_check():
    return {"status": "I'm here!"}

# Phase 3: Paranoid engineer
@app.get("/health")
async def health_check():
    return {
        "status": "healthy" if await can_actually_predict() else "lying",
        "model_loaded": model is not None,
        "database_connected": await check_db(),
        "response_time": await measure_performance(),
        "last_prediction": get_last_successful_prediction(),
        "probably_other_things": "I don't trust anything anymore"
    }

The Sweet Victory

After surviving the SMOTE saga and Docker networking drama, that moment when everything finally worked felt like magic:

bash

docker-compose up -d
# No errors? Wait, really?
curl http://localhost:8000/predict -X POST -H "Content-Type: application/json" -d '{"transaction_amount": 5000, "account_balance": 100}'
# {"fraud_probability": 0.98, "risk_level": "HIGH"} - ACTUAL REAL RESULTS!

Want to Try It Yourself?

If my journey through SMOTE-induced chaos and Docker networking mysteries sounds familiar (or if you just want to see the final, actually-working version), you can check out the complete project:

🔗 GitHub Repository: FraudShield AI

The beauty of it now is that it actually works with that famous one-command deployment:

bash

git clone https://github.com/Shodexco/fraud-detector.git
cd fraud-detector
docker-compose up -d

Then hit http://localhost:7860 for the web interface or http://localhost:8000/docs for the API documentation that actually documents a working system!

The moral of the story? Sometimes the fanciest techniques (looking at you, SMOTE) are solutions looking for problems, while the real magic is in the fundamentals: clean data, containers that actually talk to each other, and health checks that don’t lie to your face.

And if you’re wondering whether to use SMOTE on your next project—maybe just check your data distribution first. Your future self will thank you.

Leave a Comment