🔭

STARGAZER AI

Dataset & Training Guide

← Back to Pipeline

ML PIPELINE GUIDE

Reference for preparing data, formatting CSVs, and training the classifier

Pipeline Overview

  1. Upload a labelled CSV containing the target column and feature columns.
  2. Preprocessing: handle missing values, encode categoricals, scale numeric features.
  3. Feature engineering: derive ratio, log, and compound features that help the model.
  4. Apply SMOTE to balance class distribution, then run 5-fold stratified cross-validation.
  5. Train XGBoost, evaluate metrics and confusion matrix, save model to disk.
  6. Upload unlabelled test data — the same preprocessing pipeline runs automatically.

Required & Recommended Features

The pipeline is designed around KOI (Kepler Object of Interest) naming conventions. Rename your columns if they differ.

koi_disposition ← target
koi_period
koi_time0bk
koi_impact
koi_duration
koi_depth
koi_ror
koi_srho
koi_prad
koi_sma
koi_incl
koi_num_transits
koi_steff
koi_slogg
koi_smet
koi_srad
koi_smass
koi_kepmag
koi_gmag / rmag / imag
koi_jmag / hmag / kmag

Column Renaming Example

If your CSV uses different column names, rename them before uploading:

import pandas as pd

df = pd.read_csv('my_data.csv')

rename_map = {
    'period':   'koi_period',
    'depth':    'koi_depth',
    'duration': 'koi_duration',
    'prad':     'koi_prad',
}
df.rename(columns=rename_map, inplace=True)
df.to_csv('my_data_renamed.csv', index=False)

Preprocessing Checklist

  • Remove duplicate rows and obviously corrupt records.
  • Ensure koi_disposition exists with consistent labels (e.g. CONFIRMED, CANDIDATE, FALSE POSITIVE).
  • Numeric columns: missing values → median imputation; outliers → IQR clipping (3×).
  • Categorical columns: missing values → mode or 'UNKNOWN'; encode with LabelEncoder.
  • Scale numeric features with StandardScaler before model training.
  • Derived features used by the app: planets_to_star_radius_ratio, log_period, depth_to_duration.

Training Tips

  • Use stratified folds to preserve class distribution across splits.
  • Monitor accuracy, macro precision, recall, and F1 together — accuracy alone can be misleading with imbalanced classes.
  • SMOTE is applied before cross-validation — watch for overfitting if your dataset is small.
  • For XGBoost tuning: start with n_estimators=200, max_depth=6–10, learning_rate=0.05–0.1.
  • After training the model is persisted to disk and automatically reloaded on the next server restart — no need to retrain unless your data changes.

CSV Formatting

  • Comma-separated, UTF-8 encoded, with a header row.
  • Use empty cells (not strings like NaN or null) for missing values.
  • Include koi_disposition for labelled uploads; omit or leave blank for test uploads.
  • Maximum file size: 16 MB.