Guide — Stargazer AI

Pipeline Overview

Upload a labelled CSV containing the target column and feature columns.
Preprocessing: handle missing values, encode categoricals, scale numeric features.
Feature engineering: derive ratio, log, and compound features that help the model.
Apply SMOTE to balance class distribution, then run 5-fold stratified cross-validation.
Train XGBoost, evaluate metrics and confusion matrix, save model to disk.
Upload unlabelled test data — the same preprocessing pipeline runs automatically.

Required & Recommended Features

The pipeline is designed around KOI (Kepler Object of Interest) naming conventions. Rename your columns if they differ.

koi_disposition ← target

koi_period

koi_time0bk

koi_impact

koi_duration

koi_depth

koi_ror

koi_srho

koi_prad

koi_sma

koi_incl

koi_num_transits

koi_steff

koi_slogg

koi_smet

koi_srad

koi_smass

koi_kepmag

koi_gmag / rmag / imag

koi_jmag / hmag / kmag

Column Renaming Example

If your CSV uses different column names, rename them before uploading:

import pandas as pd

df = pd.read_csv('my_data.csv')

rename_map = {
    'period':   'koi_period',
    'depth':    'koi_depth',
    'duration': 'koi_duration',
    'prad':     'koi_prad',
}
df.rename(columns=rename_map, inplace=True)
df.to_csv('my_data_renamed.csv', index=False)

Preprocessing Checklist

Remove duplicate rows and obviously corrupt records.
Ensure koi_disposition exists with consistent labels (e.g. CONFIRMED, CANDIDATE, FALSE POSITIVE).
Numeric columns: missing values → median imputation; outliers → IQR clipping (3×).
Categorical columns: missing values → mode or 'UNKNOWN'; encode with LabelEncoder.
Scale numeric features with StandardScaler before model training.
Derived features used by the app: planets_to_star_radius_ratio, log_period, depth_to_duration.

Training Tips

Use stratified folds to preserve class distribution across splits.
Monitor accuracy, macro precision, recall, and F1 together — accuracy alone can be misleading with imbalanced classes.
SMOTE is applied before cross-validation — watch for overfitting if your dataset is small.
For XGBoost tuning: start with n_estimators=200, max_depth=6–10, learning_rate=0.05–0.1.
After training the model is persisted to disk and automatically reloaded on the next server restart — no need to retrain unless your data changes.

CSV Formatting

Comma-separated, UTF-8 encoded, with a header row.
Use empty cells (not strings like NaN or null) for missing values.
Include koi_disposition for labelled uploads; omit or leave blank for test uploads.
Maximum file size: 16 MB.