背景
業務上のデータ分析において。
注意事項
都合、固有名詞的な機密に関連しそうな部分は置換して隠す。
モジュールのインストールや詳細な結果は割愛。
環境はAnacondaのSpyderを利用。
細々と修正していくにはJupyterの方が便利かもと感じたりも。
まあデバックモードを使いこなせていないとも。
コード
#%%import modules import numpy as np import pandas as pd from sklearn.model_selection import train_test_split from sklearn.datasets import load_boston from sklearn.tree import DecisionTreeClassifier from sklearn.tree import export_graphviz from io import StringIO import pydotplus from IPython import display from sklearn.metrics import classification_report, confusion_matrix import statsmodels.api as sm #%%Load Data df_rent = pd.read_csv(r"C:\xxx\rent.csv", encoding = "utf-8", parse_dates = ['xxx_date']) df_rent['xxx_date'] = pd.to_datetime(df_rent['xxx_date'], errors = 'coerce') df_tel = pd.read_csv(r"C:\xxx\tel.csv", encoding = "utf-8", parse_dates = ['xxx_date']) #as sql left outer join df = pd.merge(df_rent, df_tel, how = 'left', left_on = 'ID', right_on = 'ID') #%%Set Dummy target value df.loc[df['types'] == 'zzz', 'flg'] = 1 df.loc[df['types'] == 'zzz', 'tgt_flg'] = 1 df.loc[df['status'] == 'cancel', 'tgt_flg'] = 0 df.loc[df['tgt_flg'] != 1, 'tgt_flg'] = 0 print('target sarf', sum(df['tgt_flg'])) #Exclude invalid SARF df = df.query("tgt_flg == 1 | (tbl_flg == 1 & (`xxx_date()`.notna() | `xxx_date`.notna()))", engine='python') #As Export for Excel Pivot Table df0 = df df0.to_excel(r"C:\xxx\merge.xlsx") #%%Set Dummy Basic Info df['Type_Fix'] = df['Type'].replace(r'\_.*', '') df_region = pd.get_dummies(df['region'], prefix='region') region_list = list(df_region.columns) #%%Set Dummy as Large Cate for yyy in yyy_list: df[yyy] = df['type'].str.contains(yyy) df.loc[df['type'].str.contains(yyy, na = False), 'category'] = yyy df = df.fillna({'category': 'その他'}) #法人 df = df.fillna({'法人種別': '-'}) df.loc[df['法人種別'].str.contains('法人'), 'corp'] = '法人' df.loc[df['法人種別'].str.contains('個人'), 'corp'] = '個人' df = df.fillna({'corp': '分類不可'}) #%%LT date_list = ['xxx_date'] for date in date_list: date_alt = date + '_int' df[date_alt] = dt_now - df[date] df[date_alt] = df[date_alt].dt.days df['LT'] = df['xxx_date_int'] - df['xxx_date_int'] df = df.fillna({'LT': 0}) df['LT'] = df['LT'].astype('int') #%%Model Setting x_list = ['LT'] x_list.extend(region_list) y = df['tgt_flg'] y = y.astype('int') x = df[x_list] #%%corr corr = pd.concat([y, x], axis = 1).corr() corr_series = corr['types'] ttl = x.sum() #%%decision tree analysis def decisiontree(x, y): #adjustment required tree_clf = DecisionTreeClassifier(max_depth = 6, min_weight_fraction_leaf = 0.1, class_weight = {0: 1, 1: 100}) tree_clf.fit(x, y) predicted = tree_clf.predict(x) print(len(y), ': ', int(sum(predicted == y) / len(y) * 100),'%') y_pred = tree_clf.predict(X_test) print(confusion_matrix(y_test, y_pred)) print(classification_report(y_test, y_pred)) g = StringIO() export_graphviz(tree_clf, out_file = g, feature_names = x.columns, class_names = True, rounded = True, filled = True) graph = pydotplus.graph_from_dot_data(g.getvalue()) graph.set_fontname('MS UI Gothic') #Setting font for Node for node in graph.get_nodes(): node.set_fontname('MS UI Gothic') #Setting font for Edges for e in graph.get_edges(): e.set_fontname('MS UI Gothic') graph.progs = {'dot': u"C:\\Program Files\\Graphviz\\bin\\dot.exe"} display.display(display.Image(graph.create_png())) import matplotlib.pyplot as plt feature = tree_clf.feature_importances_ label = X_train.columns indices = np.argsort(feature) fig =plt.figure (figsize = (10,10)) plt.barh(range(len(feature)), feature[indices]) plt.yticks(range(len(feature)), label[indices], fontsize=14) plt.xticks(fontsize=14) plt.ylabel("Feature", fontsize=18) plt.xlabel("Feature Importance", fontsize=18) print(x.columns) X_train, X_test, y_train, y_test = train_test_split(x, y)#, random_state=0) decisiontree(x, y) #%%model x = x.astype(int) model = sm.OLS(y * 200, sm.add_constant(x)) result = model.fit() print(result.summary()) #%%Summary TBL df_merge = df[df['tgt_flg'] == 1] fp = pd.crosstab(df['flg'], [df['region'], df['prefecture']], margins=True) fp_ = pd.crosstab(df_merge['flg'], [df_merge['region'], df_merge['prefecture']], margins=True) fp_ratio = fp_ / fp * 100 #%%out-sorced to excel df_merge.loc[df_merge['tgt_flg'] == 0, 'invalid_column'] = np.nan pvt = pd.pivot_table(df_merge, index = 'reason', columns = 'corp', values = 'LT', margins=True)
感想
と思っていたらジョブローテでBigQueryを扱うようになった。
環境が整いゴリゴリにSQLで書いてる。
そもそもデータが極めて大きいときは、いくら高性能だろうがサーバーに叶うわけもなく、ローカルで処理すべきではないですよね。
ただやはりモジュールを結合して、グラフ等をそのまま整えられるのは便利だ。
それもクラウドでSQLで集計したのをローカルにエクスポートすればいいわけだが。
余談
ちなみに上記の作業において、下記なども参考になる。
Too often, people err and think it's OK to show exploratory analysis (simply present the data, all 100 oysters) when they should be show- ing explanatory (taking the time to turn the data into information as evidence of that can be consumed by an audience: the two pearls).
— AOKI Takashige (@aochan_0119) 2022年12月20日
Bars
Sometimes bar charts are avoided because they a are common. This is a mistake. Rather, bar charts should be leveraged because they are common, as this means less of a learning curve for your audience. Instead of using their brain power to try to understand how to read the graph, your audience spends it figuring out what information to take away from the visual.
Bar charts are easy for our eyes to read. Our eyes compare the end points of the bars, so it is easy to see quickly which category is the biggest, which is the smallest, and also the incremental difference between categories. Note that, because of how our eyes compare the relative end points of the bars, it is important that bar charts always have a zero baseline (where the x-axis crosses the y-axis at zero), otherwise you get a false visual comparison.