Journal 41
May 13 - May 19
This week the lectures and labs focused mostly on continuous variables, probability distributions, and visualization techniques. This week was really interesting, and actually quite fun because it felt like the point where statistics, mathematics, and programming all started coming together in a much more practical way. As someone with a degree in math, it was really interesting to see how theory from math courses translated over into actual exploratory data analysis workflows in the data science field. One of the biggest ideas from this week's learning was understanding how to visualize the distribution of data. We worked with density plots, histograms, and box plots, and I learned how each visualization emphasizes different aspects of the data, and which were better to use when. Density plots were especially interesting to me because they connect directly to concepts from probability theory and continuous distributions. In math, I learned about probability density functions abstractly, but seeing density estimation used on real datasets made the idea feel much more concrete. Histograms also reinforced the importance of bin selection and how visualizations can unintentionally distort interpretation if the bins are poorly chosen. I had to make sure to be precise in the labs, particularly in the college data lab, as I was matching my output to the given samples.
I also found the discussion of skewness and log transformations very valuable. The examples involving heavily right-skewed variables made me think about how common this type of distribution probably is in real-world data. For example, I thought about baseball analytics in which variables like salary, homerun totals, or social media engagement likely have strong positive skew because a small number of players are extreme outliers. It was interesting seeing how a simple logarithmic transformation can make a dataset much easier to interpret visually. I also enjoyed the relationship between the normal distribution and real data. Since I have a math background, I was already familiar with Gaussian distributions and standard deviation, but I appreciated learning how these ideas are actually implemented in Python using SciPy. Creating distributions via stats.norm() and generating random samples using rsv() made the concepts feel more computational and practical. I also found it interesting that random samples drawn from the same theoretical distribution can still produce very different density plots. This helped to reinforce the idea that datasets are only samples from underlying populations, which is a concept that sometimes feels obvious mathematically, but can be much more meaningful when visualized.
This week's homework made me much more comfortable working in Google Colab, as I am new to using notebook-style workflows. I can see now why notebooks are commonly used in data science. Looking forward to my future career goals, this week's material felt very relevant. Visualization is critical in baseball because analysts need to communicate findings clearly to coaches, scouts, and front office staff who may not have technical backgrounds. A well-designed visualization can make patterns obvious immediately, while a poor visualization can make good analysis difficult to understand. I can imagine using density plots to compare pitch velocities, histograms to study launch angle distributions, or box plots to compare player consistency across seasons.
Comments
Post a Comment