Journal 40
May 6 - May 12
This is one of the few courses I have taken where I can actually say I am excited to sit down and start learning each week. This week's lectures and videos helped me better understand how important data distributions and aggregation are in the data science field. At the start, many of the Pandas operations felt like separate tools I had to memorize, which would explain the Pandas flashcards we are given in the course modules, but after working through the homework labs (and my gosh, did that take me a while), I began to see how they connect together when analyzing datasets. I learned how to use series and dataframes, apply aggregation functions such as mean and median, how to group data with groupby(), and how to use value_counts() . These all help to summarize categorical variables, and definitely came in handy for the labs. I also got some more practice with boolean masks and vectorized operations, which I prefer over writing loops. I think the biggest thing I learned this week was how data scientists work with entire columns of data, which saves time over processing just one value at a time. This will absolutely be useful in my future work.
One thing I found particularly interesting was how often probability and distributions appeared in situations that I did not initially think were statistics. The readings made me think about how much of machine learning is really about understanding variables and how they are related to each other. For Mother's Day, I took my grandma to the new Hard Rock casino and won some money playing craps, so this game was fresh on my mind this week. I made an interesting connection between playing craps to probability and distributions. While I was working through the readings, I started thinking more about how games of chance are built around statistical outcomes and expected values. This summer I want to experiment using some of the data science concepts from this class to better understand betting probabilities and patterns in craps while I'm on a family trip in Las Vegas. Even if it does not change the outcomes of the game itself, I think it would be a fun way to apply concepts like probability, distributions, conditional probability, and data analysis to something outside of the classroom.
The bike sharing lab was probably the most useful assignment for me this week, because it felt the most like real-world data analysis, which I want to do in the future. I liked working with a large dataset and using aggregation to answer practical questions about rider behavior, age groups, routes, and user types. By using groupby() to compare trip lengths across demographics, I was able to see how data science can be used to identify trends and make predictions. Since my long-term goal is to work in MLB data science, I can imagine using very similar techniques to analyze player tendencies, pitch usages, or performance splits between different groups of players. Even something as simple as grouping data by age group or player type could be useful in baseball analytics.
One challenge I had this week was interpreting some of the probability distribution graphs, specifically estimating probabilities directly from PDFs and CDFs. I understand the general ideas, but sometimes it is difficult to estimate values visually from a graph, and I begin to doubt if my estimation is reasonable or not. I had to remind myself that for continuous variables, the probability of a single exact value is always zero. Another thing I had to pay attention to was assignment formatting in the Pandas labs.
This week felt like it moved me closer to thinking like a data scientist rather than just learning Python syntax, which I have come to really appreciate and enjoy. I am starting to see how distributions, aggregation, and grouping can be used to answer meaningful questions from data, and I think these concepts will be very important for the kind of sports analytics I want to do in the future. I am looking forward to next week.
Comments
Post a Comment