loading

어학/어학 지식

[Review] kaggle: app store project

S부장 in US 2021. 11. 9. 22:44

 

 

store_data.head()

1 281656475 PAC-MAN Premium 100788224 USD 3.99 21292 26 4.0 4.5 6.3.5 4+ Games 38 5 10 1
2 281796108 Evernote - stay organized 158578688 USD 0.00 161065 26 4.0 3.5 8.2.2 4+ Productivity 37 5 23 1
3 281940292 WeatherBug - Local Weather, Radar, Maps, Alerts 100524032 USD 0.00 188583 2822 3.5 4.5 5.0.0 4+ Weather 37 5 3 1
4 282614216 eBay: Best App to Buy, Sell, Save! Online Shop... 128512000 USD 0.00 262241 649 4.0 4.5 5.10.0 12+ Shopping 37 5 9 1
5 282935706 Bible 92774400 USD 0.00 985920 5320 4.5 5.0 7.5.1 4+ Reference 37 5 45 1

 

 

Top 10 apps on the basis of total rating

total rating is a rough indicator of number of downloads so we will treat total rating count as a target variable in place of number of downloads. As, more number of rating suggests more users.

 

store_data_sorted = store_data.sort_values('rating_count_tot', ascending=False)subset_store_data_sorted = store_data_sorted[:10]visualizer(subset_store_data_sorted.track_name, subset_store_data_sorted.rating_count_tot, "bar", "TOP 10 APPS ON THE BASIS OF TOTAL RATINGS", "APP NAME", "RATING COUNT (TOTAL)", True, -60)

 

 

Top 10 apps on the basis of price

App store only features app in USD currency (in this dataset)

 

store_data.currency.unique()

array(['USD'], dtype=object)

store_data_price = store_data.sort_values('price', ascending=False)subset_store_data_price = store_data_pricevisualizer(subset_store_data_price.price, subset_store_data_price.track_name, "bar", "TOP 10 APPS ON THE BASIS OF PRICE", "Price (in USD)", "APP NAME")

 

 

Linear Correlation of Features

lang.num (number of languages app support) shows the highest correlation with the rating_count_tot(total rating count).

 

corr_store_data = store_data.corr()corr_store_data["rating_count_tot"].sort_values(ascending=False)

 

plt.figure(figsize=(15,15))plt.title("CORRELATION OF FEATURES", fontsize=20)sns.heatmap(corr_store_data)plt.xticks(rotation=(-60), fontsize=15)plt.yticks(fontsize=15)plt.show()

 

 

 

 

User Ratings on the App Store

Here is the count plot of the user ratings on Apple App store we see lots of apps are rated 0

 

visualizer(store_data.user_rating, None, "count","RATINGS ON APP STORE", "RAITNGS", "NUMBER OF APPS RATED")

 

add Codeadd Markdown

 

 

User favourites

rating_count_tot * user_rating can tell user favourites and rating_count_ver * user_rating_ver will tell us the favourites in current version

 

store_data["favourites_tot"] = store_data["rating_count_tot"] * store_data["user_rating"]store_data["favourites_ver"] = store_data["rating_count_ver"] * store_data["user_rating_ver"]

 

favourite_app = store_data.sort_values("favourites_tot", ascending=False)favourite_app_subset = favourite_app[:10]visualizer(favourite_app_subset.track_name, favourite_app_subset.rating_count_tot, "bar", "FAVOURITES (ALL TIME)", "APP NAME", "RATING COUNT(TOTAL)", True, -60)

 

 

favourite_app_ver = store_data.sort_values("favourites_ver", ascending=False)favourite_app_ver_subset = favourite_app_ver[:10]visualizer(favourite_app_ver_subset.rating_count_ver,favourite_app_ver_subset.track_name, "bar", "FAVOURITES (CURRENT VERSION)","RATING COUNT(CURRENT VERSION)","APP NAME", False)

 

 

 

 

TRYING MACHINE LEARNING ALGORITHMS

add Codeadd Markdown

 

[27]:

 

 

from sklearn.model_selection import cross_val_score#Scoring ML model(Using Negative root mean squared error) made easydef model_scoring(model_name, model, X, y): #Cross Validation scores = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=10) #Scores rmse = np.sqrt(-scores) mean = rmse.mean() std = rmse.std() print(model_name) print() print("RMSE: {}".format(rmse)) print("MEAN: {}".format(mean)) print("STD: {}".format(std))

 

add Codeadd Markdown

 

 

Linear Regression

# Modelfrom sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg = lin_reg.fit(X, y)# Scoresmodel_scoring("Linear Regression", lin_reg, X, y)

 

Linear Regression RMSE: [ 49285.88154393 131839.57017058 53369.1195387 49738.45297409 122617.5801521 57107.76260608 53386.11011339 34798.8677453 55774.51046226 50504.61253404] MEAN: 65842.24678404737 STD: 31304.711717799008

 

 

Polynomial Regression

from sklearn.preprocessing import PolynomialFeaturespoly_features = PolynomialFeatures(degree=2, include_bias=False)x_poly = poly_features.fit_transform(X)poly_reg = LinearRegression()poly_reg = poly_reg.fit(x_poly, y)# Scoresmodel_scoring("Polynomial Regression", poly_reg, x_poly, y)

 

Polynomial Regression RMSE: [ 52167.67300318 127537.09974104 220638.21959749 79865.75375836 132654.09860815 239039.84114242 346957.11025312 35885.25245135 411828.9972216 48778.24192886] MEAN: 169535.22877055555 STD: 124503.64878687791

add Codeadd Markdown

 

 

Support Vector Regression

from sklearn.svm import SVRsvr = SVR(kernel="linear")y_ravel = y.ravel()svr = svr.fit(X, y_ravel)# Scoresmodel_scoring("Support Vector Regression", svr, X, y_ravel)

 

Support Vector Regression RMSE: [ 51852.98471085 135462.89489448 56253.82780241 52637.14672826 126437.97460084 59117.82582051 56813.71707107 37658.4616797 58245.96025036 54045.38340168] MEAN: 68852.61769601652 STD: 31635.728102472327

add Codeadd Markdown

 

 

Decision Tree Regression

# Modelfrom sklearn.tree import DecisionTreeRegressordec_tree = DecisionTreeRegressor()dec_tree = dec_tree.fit(X, y)# Scoresmodel_scoring("Decision Tree Regression", dec_tree, X, y)

 

Decision Tree Regression RMSE: [103872.78863749 128678.59984747 57741.61797849 70701.90745508 143381.52653562 67726.9762705 55034.46065352 100750.13357272 64338.28803757 55208.57068536] MEAN: 84743.48696738183 STD: 30623.512101303342

 

 

 

https://www.kaggle.com/stephanjo/analysis-of-apple-s-app-store/edit

 

300x250