![]() ![]() ![]() For example, some foreign movies have gross and budget in foreign currencies and we manually converted these foreign currencies to USD in current exchange rate. There are some sources of noise in the dataset. We collected data initially with no explicit links between individual instances, because we want to analyze this dataset objectively. The reason why these data points are missing is because they were not available on original IMDB website because some movies are really old and therefore hard to collect data from sixty years ago. Our original dataset also included Metascore for each instance but we decided to delete it at the end because a lot of movies’ megascores were missing. However, for the instances in which gross or budget information weren’t available, we removed the instances. For the instances that we could find information on elsewhere, we added the information into the csv file (our dataset). There is no label or target associated with each instance.Īmong all of our 500 instances, certain data points of variables were missing such as some movies’ gross and budget. Numerical variables for each instance consist of information on duration, rating, vote, gross, and budget collected as unprocessed text and values also scraped directly from the webpage. Our categorical variables for each instance consist of information on certificate, genre, country, and language collected as unprocessed text directly scraped from the webpage. Our dataset contains all possible instances from the webpage.įor each instance, it consists of certificate, duration, rating, genre, vote, gross, country, language and budget. There are 500 instances (movies) in total. The dataset is comprised only of movies as the type of instance. Our team created this database and obtained the information in this dataset from scraping, using Beautiful Soup, from a page created by IMDb in 2017 - Top Greatest Movie of All Times. The purpose for creating this dataset is to analyze the primary factors that influence a movie’s success, measured by movie rating. This article will be focusing on the first part of this research project, which is Dataset Creation through data scraping and cleaning. ![]() Use descriptive statistics and multi-regression modeling to visualize and analyze the data that we have collected. Use data-scraping technique to extract data from a IMDb movie list, and create a Dataset.Ģ. We determined to look at IMDb “ Top 500 Greatest Movies of All Time” we used movie as instances to collect various quantities that are related to each movie, and subsequently conducted statistical analysis regarding the dataset. When I look at those top movies list, I always wonder, what are the primary factors that influence a movie’s success, is it budget, box office, language, or movie genre?īy an accidental chance, me and another three movie lovers (also data lovers) decided to conduct a statistical research regarding the influential factors of a movie’s success. I have always been an enthusiastic fan of movie, and I like to explore great movies through looking at different film ratings / reviews websites, such as IMDb, Rotten Tomatoes, etc. The most influential factor of IMDB movie rating - Part I: Data Scraping ![]()
0 Comments
Leave a Reply. |