Journal 42

 May 20 - May 26

This course is absolutely flying by. After a bit of research and practice, I finally am really getting the hang of using Google Colab, and I am thoroughly enjoying creating and editing the various graphs. This week's work focused on probability distributions, relationships between variables, and data visualization. I learned a lot about the difference between joint, marginal, and conditional probabilities. At first some of the notation was confusing such as deciding when to use P(A|B) or P(A and B). However, after working through the notes and homework, I started to understand that conditional probability is really about narrowing the dataset to a specific group first and then measuring probabilities within that group. The crosstab examples using normalize='index' and normalize='columns' helped to clarify this idea because I could visually see how the probabilities changed depending on which variable was being conditioned on. We also spent time learning how to visualize relationships between categorical variables. I learned how grouped bar plots, stacked bar plots, and heatmaps can all represent similar information but emphasize different parts of the data. One thing I found interesting was the importance of choosing the right variable to condition on. The notes gave an example of weather and vehicle damage, and that made the idea much clearer. If we think weather influences damage, then we should visualize P(damage|weather), not the other way around. 

Another important part of the learning from this week was the campaign contribution homework assignment. This lab helped me to connect the statistical ideas to actual exploratory data analysis. I enjoyed recreating the plots showing relationships between contribution amounts, occupations, employment status, and political candidates. I like the homeworks where we have to match sample data because it is fun and interesting to tweak the plots and see how different code changes different parts of the graphs. Also, the third homework assignment, open_viz, was very helpful for my learning this week because I learned that there was more than one correct visualization choice, and we often have to filter between "correct" answers to find a best fit. I had originally used a boxplot to compare hours worked across work classes, as it contained more information. However, I saw that the plot looked crowded and overwhelming, so I switched to a median bar plot. After the comparison, I realized simpler plots can sometimes communicate information more clearly than more complicated ones. The boxplot technically had more detail, but because there were so many outliers it became cluttered and difficult to read. The median bar plot ended up being cleaner and easier to interpret. This was a useful lesson for me because in data science, readability and communication are just as important as statistical correctness. 

I also got a lot of practice using grouping and aggregation operations in Pandas especially groupby(), median(), value_counts(), and crosstab(). I'm getting mcuh more comfortable thinking about how to summarize data in different ways depending on the question being asked. In particular, I noticed how median values can sometimes describe skewed data better than means, especially in datasets with large outliers like campaign contributions. One thing I still want to improve on is choosing the best type of visualization automatically. Sometimes I can create multiple plots that are technically valid, but I am still learning which visualization communicates the relationship most effectively. This week felt much more like real data science than earlier weeks because we are not just calculating statistics, but rather we are trying to interpret relationships, make decisions about visualizations, and explain the patters in the data. I can already see how these skills would apply directly to sports analytics, especially in baseball where analysts constantly compare conditional probabilities and visualize relationships between variables like pitch type, count, hitter tendencies, and outcomes. I am excited to see what the next week of lessons and labs hold.

Comments

Popular Posts