store_data.head()
1 | 281656475 | PAC-MAN Premium | 100788224 | USD | 3.99 | 21292 | 26 | 4.0 | 4.5 | 6.3.5 | 4+ | Games | 38 | 5 | 10 | 1 |
2 | 281796108 | Evernote - stay organized | 158578688 | USD | 0.00 | 161065 | 26 | 4.0 | 3.5 | 8.2.2 | 4+ | Productivity | 37 | 5 | 23 | 1 |
3 | 281940292 | WeatherBug - Local Weather, Radar, Maps, Alerts | 100524032 | USD | 0.00 | 188583 | 2822 | 3.5 | 4.5 | 5.0.0 | 4+ | Weather | 37 | 5 | 3 | 1 |
4 | 282614216 | eBay: Best App to Buy, Sell, Save! Online Shop... | 128512000 | USD | 0.00 | 262241 | 649 | 4.0 | 4.5 | 5.10.0 | 12+ | Shopping | 37 | 5 | 9 | 1 |
5 | 282935706 | Bible | 92774400 | USD | 0.00 | 985920 | 5320 | 4.5 | 5.0 | 7.5.1 | 4+ | Reference | 37 | 5 | 45 | 1 |
Top 10 apps on the basis of total rating
total rating is a rough indicator of number of downloads so we will treat total rating count as a target variable in place of number of downloads. As, more number of rating suggests more users.
store_data_sorted = store_data.sort_values('rating_count_tot', ascending=False)subset_store_data_sorted = store_data_sorted[:10]visualizer(subset_store_data_sorted.track_name, subset_store_data_sorted.rating_count_tot, "bar", "TOP 10 APPS ON THE BASIS OF TOTAL RATINGS", "APP NAME", "RATING COUNT (TOTAL)", True, -60)
Top 10 apps on the basis of price
App store only features app in USD currency (in this dataset)
store_data.currency.unique()
array(['USD'], dtype=object)
store_data_price = store_data.sort_values('price', ascending=False)subset_store_data_price = store_data_pricevisualizer(subset_store_data_price.price, subset_store_data_price.track_name, "bar", "TOP 10 APPS ON THE BASIS OF PRICE", "Price (in USD)", "APP NAME")
Linear Correlation of Features
lang.num (number of languages app support) shows the highest correlation with the rating_count_tot(total rating count).
corr_store_data = store_data.corr()corr_store_data["rating_count_tot"].sort_values(ascending=False)
plt.figure(figsize=(15,15))plt.title("CORRELATION OF FEATURES", fontsize=20)sns.heatmap(corr_store_data)plt.xticks(rotation=(-60), fontsize=15)plt.yticks(fontsize=15)plt.show()
User Ratings on the App Store
Here is the count plot of the user ratings on Apple App store we see lots of apps are rated 0
visualizer(store_data.user_rating, None, "count","RATINGS ON APP STORE", "RAITNGS", "NUMBER OF APPS RATED")
add Codeadd Markdown
User favourites
rating_count_tot * user_rating can tell user favourites and rating_count_ver * user_rating_ver will tell us the favourites in current version
store_data["favourites_tot"] = store_data["rating_count_tot"] * store_data["user_rating"]store_data["favourites_ver"] = store_data["rating_count_ver"] * store_data["user_rating_ver"]
favourite_app = store_data.sort_values("favourites_tot", ascending=False)favourite_app_subset = favourite_app[:10]visualizer(favourite_app_subset.track_name, favourite_app_subset.rating_count_tot, "bar", "FAVOURITES (ALL TIME)", "APP NAME", "RATING COUNT(TOTAL)", True, -60)
favourite_app_ver = store_data.sort_values("favourites_ver", ascending=False)favourite_app_ver_subset = favourite_app_ver[:10]visualizer(favourite_app_ver_subset.rating_count_ver,favourite_app_ver_subset.track_name, "bar", "FAVOURITES (CURRENT VERSION)","RATING COUNT(CURRENT VERSION)","APP NAME", False)
TRYING MACHINE LEARNING ALGORITHMS
add Codeadd Markdown
[27]:
from sklearn.model_selection import cross_val_score#Scoring ML model(Using Negative root mean squared error) made easydef model_scoring(model_name, model, X, y): #Cross Validation scores = cross_val_score(model, X, y, scoring="neg_mean_squared_error", cv=10) #Scores rmse = np.sqrt(-scores) mean = rmse.mean() std = rmse.std() print(model_name) print() print("RMSE: {}".format(rmse)) print("MEAN: {}".format(mean)) print("STD: {}".format(std))
add Codeadd Markdown
Linear Regression
# Modelfrom sklearn.linear_model import LinearRegressionlin_reg = LinearRegression()lin_reg = lin_reg.fit(X, y)# Scoresmodel_scoring("Linear Regression", lin_reg, X, y)
Linear Regression RMSE: [ 49285.88154393 131839.57017058 53369.1195387 49738.45297409 122617.5801521 57107.76260608 53386.11011339 34798.8677453 55774.51046226 50504.61253404] MEAN: 65842.24678404737 STD: 31304.711717799008
Polynomial Regression
from sklearn.preprocessing import PolynomialFeaturespoly_features = PolynomialFeatures(degree=2, include_bias=False)x_poly = poly_features.fit_transform(X)poly_reg = LinearRegression()poly_reg = poly_reg.fit(x_poly, y)# Scoresmodel_scoring("Polynomial Regression", poly_reg, x_poly, y)
Polynomial Regression RMSE: [ 52167.67300318 127537.09974104 220638.21959749 79865.75375836 132654.09860815 239039.84114242 346957.11025312 35885.25245135 411828.9972216 48778.24192886] MEAN: 169535.22877055555 STD: 124503.64878687791
add Codeadd Markdown
Support Vector Regression
from sklearn.svm import SVRsvr = SVR(kernel="linear")y_ravel = y.ravel()svr = svr.fit(X, y_ravel)# Scoresmodel_scoring("Support Vector Regression", svr, X, y_ravel)
Support Vector Regression RMSE: [ 51852.98471085 135462.89489448 56253.82780241 52637.14672826 126437.97460084 59117.82582051 56813.71707107 37658.4616797 58245.96025036 54045.38340168] MEAN: 68852.61769601652 STD: 31635.728102472327
add Codeadd Markdown
Decision Tree Regression
# Modelfrom sklearn.tree import DecisionTreeRegressordec_tree = DecisionTreeRegressor()dec_tree = dec_tree.fit(X, y)# Scoresmodel_scoring("Decision Tree Regression", dec_tree, X, y)
Decision Tree Regression RMSE: [103872.78863749 128678.59984747 57741.61797849 70701.90745508 143381.52653562 67726.9762705 55034.46065352 100750.13357272 64338.28803757 55208.57068536] MEAN: 84743.48696738183 STD: 30623.512101303342
https://www.kaggle.com/stephanjo/analysis-of-apple-s-app-store/edit