Posts

Topic Modeling... Digging Deeper

      Natural Language Processing (NLP) allows machines to understand and process human language and text. Within this field, topic modeling stands out as a potent technique that aids in uncovering hidden patterns and themes within large collections of text data. In this blog post, we will explore how topic modeling empowers NLP and enhances a wide range of applications.      Topic modeling algorithms, such as Latent Dirichlet Allocation (LDA) and Latent Semantic Analysis (LSA), offer a powerful means of understanding the structure of text data. By analyzing the co-occurrence patterns of words, these algorithms automatically extract latent topics, revealing the underlying themes within a corpus. This enables researchers and developers to gain valuable insights into the content and organization of vast amounts of text data.      One of the key advantages of topic modeling in NLP is its ability to cluster similar documents together. By ass...

How to Attack the Curse of Dimensionality

Compacting High Feature Datasets     Intro   In the field of data science, the exponential growth of data has led to an increasing need to handle high-dimensional datasets efficiently. The curse of dimensionality poses challenges for analysis and modeling, making dimensionality reduction techniques crucial. This post aims to highlight different approaches to reduce dimensionality, highlighting their strengths weaknesses and practical applications.     The dimensionality of a dataset refers to the number of features or variables present. High-dimensional data often suffer from sparsity, noise, and computational complexity, which can hinder data analysis and machine learning tasks. Dimensionality reduction methods aim to transform the original dataset into a lower-dimensional representation, while preserving the most important information. Dimensionality Reduction Methods  Feature selection techniques aim to identify the most relevant subset o...

Movie Data Analysis

Image
     For our first project at the Flatiron School, we were tasked with giving three recommendations to Microsoft on how to enter the movie industry space. We used the Python package Pandas and SQL to load in and clean data from data sets as well as analyze and create visualizations. We were to present our findings to the c-suite members of the company and give them our three best recommendations. Before we started our project we had to really understand the business the movie industry is in and the best way to access the market. We understood that the market is very saturated with many big competitors like Warner Bros, Paramount Pictures, Disney, etc. So how does Microsoft best navigate this challenge of giants and become a success? A good measure of success in any industry is the return on investment and profit. So, we started our wider focus on movies and their profitability. We then got more granular in our analysis of the profitability of top: movie genres, movie run ...

Why Data Science

Why Data Science?    As a kid and a teenager, I was very active. No matter what season it was, I was watching and playing a sport. Sports have been a very large influence on my life and for the longest time, I wanted to be a college athlete. In high school, I honed in on just playing baseball and my goal was to play at a Division 1 school. The constant performance pressure of playing in front of scouts my junior year mentally changed my outlook on the sport I used to love to play. However, that same year I was taking an AP statistics course that was very intriguing to me. Even though my passion for playing the game had changed, I was now analyzing the sport through a different lens. I would look at the iPad which tracked stats after games viewing the advanced stats and trends. I wasn't only interested in baseball and my performances. I loved watching basketball, football, and golf as well!      While watching these sports, screen graphic pop ups would always dr...