Introduction

Scikit-learn is one of the most popular machine learning libraries for python. It's built on top of NumPy, SciPy, and Matplotlib, making it an efficient and user-friendly toolkit for data analysis, predictive modeling and AI-driven applications.

Key Features of Scikit-learn:

  • Simple and efficient tools for data mining and analysis.
  • Built-in algorithms for classification, regression, clustering and more.
  • Support for preprocessing tasks like feature selection, normalization and dimensionality reduction.
  • Extensive documentation and active community to help developers and data scientists.

Code

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Step 1: Generate Sample Data
np.random.seed(42)
X = np.random.rand(100, 2)  # 100 samples, 2 features
y = (X[:, 0] + X[:, 1] > 1).astype(int)  # Labels based on sum of features

# Step 2: Split the Data into Training and Testing Sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Step 3: Standardize the Features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Step 4: Train the Logistic Regression Model
model = LogisticRegression()
model.fit(X_train_scaled, y_train)

# Step 5: Make Predictions
y_pred = model.predict(X_test_scaled)

# Step 6: Evaluate the Model
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")

Explanation:

  1. Data Generation: We create random data points and define labels based on a simple rule.
  2. Splitting the Dataset: The dataset is divided into training(80%) and testing(20%) parts.
  3. Feature Scaling: Standardizing features helps improve the performance of many models.
  4. Model Training: We use logistic Regression, a popular algorithm for binary classification.
  5. Prediction: After training, the model predicts labels for the test data.
  6. Evaluation: We measure how well the model performs using accuracy score.