Journal 43
May 27 - June 2
I have been very much looking forward to this week's machine learning topic. We focused on data preprocessing, KNN classification, train/test splits, cross-validation, and methods for evaluating machine learning models. One of the biggest lessons I learned was that data cleaning is one of the most important steps in a data science project. Before this week, I tended to think of machine learning as mostly being about building models, but the lectures and homework showed that poor quality data can cause problems way before a model is trained. I learned how to identify missing values using functions such as isna(), how to count missing values by row and column, and how to decide whether missing data should be removed or imputed. I also learned that missing data can sometimes be disguised as unusual values such as zeros or special strings, which made the campaign contribution and diabetes homeworks particularly interesting and interactive.
One topic I found challenging at first was deciding how to handle missing values. The lectures showed me there is not always a single correct answer. Sometimes removing rows is appropriate, while in other situations it is better to replace missing values with the mean, median, mode, or other estimate. I also found it very interesting that some values that appear valid may actually represent missing data. For example, in the diabetes set, values of zero for variables such as BMI and blood pressure were suspicious and likely represented missing information rather than actual measurements. This furthered the idea that data preprocessing requires critical thinking and understanding of the context of the data, not just the technical skills. Another important concept this week was feature scaling. I learned about unit interval scaling and z-score normalization, and I saw why scaling is especially important for distance-based algorithms such as KNN. Before scaling, variables with larger numeric ranges can dominate distance calculations, leading to poor predictions. After scaling, all predictors contribute more fairly to the model. The visualization of scaled data helped me understand how outliers can affect distributions and why scaling can be an essential preprocessing step.
The machine learning portion of the week (that I was looking forward to) was focused on K-Nearest Neighbors classification. I learned how KNN makes predictions by looking at the closest training examples and allowing them to vote on the predicted class. I also learned how the value of k affects model performance. Small values of k can lead to overfitting, while large values of k can make the model too simple. Cross-validation was especially helpful because it provided a systematic way to choose a good value of k without relying on a single train/test split. I found it useful to see how accuracy changed as k changed and how cross-validation could help estimate performance on future unseen data. I also learned that accuracy alone does not always tell the whole story. Through confusion matrices, precision, recall, and F1 scores, I learned that two models with similar accuracy can have very different strengths and weaknesses. This was very clear in the heart disease examples from the lecture. In some situations, minimizing false negatives may be more important than maximizing overall accuracy. This gave me a better understanding of why data scientists often use multiple evaluation metrics instead of relying on a single number.
These topics are directly related to my future career goal, and each week I attempt to connect and visualize how the learnings will help me with that. Baseball data sets often contain missing information, unusual values, and variables measured on very different scales. Understanding how to clean and preprocess data will help ensure that player evaluation models and predictive analytics are based on reliable information. KNN and model evaluation techniques could be used to identify players with similar performance profiles, classify prospects, or predict player outcomes. Cross-validation will also be valuable because it helps estimate how well a model will perform on future seasons rather than just on historical data. This week helped me to understand that successful machine learning depends not only on building models, but also on preparing data correctly and evaluating model carefully.
Comments
Post a Comment