Journal 39

 April 29 - May 5

FINALLY the long awaited data science class has begun! I am sure my peers in my cohort are tired of hearing me talk about it by now, but I have been anxiously awaiting for this course to begin, as I hope to go into the field of data science after graduation. This first week was my official, formal introduction to data science, though I have researched various parts of it previously. Before this class began, I had limited experience with coding in Python, I had only worked a bit in it before, and I had no prior experience with Spyder, Jupyter, or Google Colab. This was my first introduction to NumPy, which deals more with arrays and operations on large datasets. As discussed in the lecture slides, NumPy is ideal for storage and operations in numeric data. I learned about concepts such as slicing, boolean masks, and vectorized operations, as well as why they are powerful tools when working with large amounts of data and processing them efficiently. 

After completing the homework assignments, I can review the topics that stumped me. One such area that confused me was the boolean indexing in NumPy. For example, number 16 in the 1D lab was a bit difficult to work through the "think before you run it" portion, because my original thought was not the IndexError output. I originally believed that the mask would be able to pick matching positions, but after running the code, I saw that the boolean masks need to be the same length as the array they are indexing. Because x was length 10 and the mask was length 8, there was an error. Working through that error in judgement and really delving into the problem helped me to better understand how NumPy arrays and masks interact. Also, in the 2D lab, I made a mistake in the calculation for the last problem that asked for filtering colleges based on undergraduate enrollment. While this was more of a mathematical error / typo (i used > instead of <), it was a good refresher in the concept that one tiny mistake can entirely change the code. Seeing that the last few classes contained little actual coding and were more about art, writing, and community service, it was a good refresher in being sure to test and double-check code. Furthermore, catching an error as small as that made me think more about the importance of precision in data analysis, because a mistake as simple as pushing the key next to the correct one can completely change the interpretation of results. 

In addition, the homework really helped me to understand how 2D arrays relate to datasets. In the slides, seeing syntax like X[:,2] felt a bit confusing and abstract, but the homework questions helped me to see how column manipulation is used in real projects. I think one of the most interesting concepts I learned this week was vectorized operations because they allow for calculations across entire datasets without the need to write complex loops. This seems especially important in sports analytics because teams analyze thousands of rows of player and game data.

My long-term goal is to work in data science or analytics for Major League Baseball, and this week's topics already feel directly connected to that goal. Baseball analytics focus heavily on organization, filtering, and analyzing large datasets. To illustrate, a team might utilize boolean masks to isolate pitchers above a certain strikeout rate, or vectorized operations to calculate averages and advanced statistics for every player at once. Learning NumPy and Python gives me a solid foundation to eventually work with larger tools in the MLB such as pandas, machine learning libraries, and visualization software. 

While learning this week, I found myself thinking a lot about how data analytics is becoming even more and more important across professional sports. Recently, there has been a lot of discussion about how MLB teams are increasingly using analytics for player development, defensive positioning, and even injury prevention. Teams track HUGE amounts of data from systems like Statcast. While doing the homework labs this week, I can already begin to see how the ability to manipulate arrays and datasets efficiently would be absolutely necessary in an environment where analysts process vast amounts of information every day. Also, to tie this in to our last class, having clean and effective code is absolutely necessary to prevent bias, as data analysis in the MLB would likely decide drafts, player matchups, and playing time.

This week helped me to start developing a sort of data science mindset, wherein I'm thinking beyond just writing loops and algorithms, and starting to think about how to manipulate and analyze entire datasets in the most efficient way possible. Though some concepts were confusing at first, especially multidimensional indexing, taking detailed notes and working through mistakes in the homework helped to strengthen my understanding of these topics. I feel as though I now have a much clearer idea of how Python and NumPy will connect to real-world analytics work in the future.

Comments

Popular Posts