AOKI's copy&paste archive

高専から駅弁大学から東工大を経て大企業へ 浅く広い趣味とかキャリアの日記を

SQL.py

背景

業務上のデータ分析において。

注意事項

都合、固有名詞的な機密に関連しそうな部分は置換して隠す。
モジュールのインストールや詳細な結果は割愛。
環境はAnacondaのSpyderを利用。
細々と修正していくにはJupyterの方が便利かもと感じたりも。
まあデバックモードを使いこなせていないとも。

方法

題名通りSQL的内容をPythonで。
単純なデータ結合ならMSのAccessで十分だが、使い慣れておらず、また追加的分析も行うには相性がいいだろうと押し通した。

コード

#%%import modules
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import export_graphviz
from io import StringIO
import pydotplus
from IPython import display
from sklearn.metrics import classification_report, confusion_matrix
import statsmodels.api as sm

#%%Load Data
df_rent = pd.read_csv(r"C:\xxx\rent.csv", 
                 encoding = "utf-8", 
                 parse_dates = ['xxx_date'])

df_rent['xxx_date'] = pd.to_datetime(df_rent['xxx_date'],
                                        errors = 'coerce')

df_tel = pd.read_csv(r"C:\xxx\tel.csv", 
                 encoding = "utf-8",
                 parse_dates = ['xxx_date'])

#as sql left outer join
df = pd.merge(df_rent, df_tel, how = 'left',
              left_on = 'ID', right_on = 'ID')

#%%Set Dummy target value
df.loc[df['types'] == 'zzz', 'flg'] = 1
df.loc[df['types'] == 'zzz', 'tgt_flg'] = 1
df.loc[df['status'] == 'cancel', 'tgt_flg'] = 0
df.loc[df['tgt_flg'] != 1, 'tgt_flg'] = 0
print('target sarf', sum(df['tgt_flg']))

#Exclude invalid SARF
df = df.query("tgt_flg == 1 | (tbl_flg == 1 & (`xxx_date()`.notna() | `xxx_date`.notna()))", engine='python')

#As Export for Excel Pivot Table
df0 = df
df0.to_excel(r"C:\xxx\merge.xlsx")

#%%Set Dummy Basic Info
df['Type_Fix'] = df['Type'].replace(r'\_.*', '')

df_region = pd.get_dummies(df['region'], prefix='region')
region_list = list(df_region.columns)


#%%Set Dummy as Large Cate
for yyy in yyy_list:
    df[yyy] = df['type'].str.contains(yyy)
    df.loc[df['type'].str.contains(yyy, na = False), 'category'] = yyy
    
df = df.fillna({'category': 'その他'})


#法人
df = df.fillna({'法人種別': '-'})
df.loc[df['法人種別'].str.contains('法人'), 'corp'] = '法人'
df.loc[df['法人種別'].str.contains('個人'), 'corp'] = '個人'
df = df.fillna({'corp': '分類不可'})


#%%LT
date_list = ['xxx_date']
for date in date_list:
    date_alt = date + '_int'
    df[date_alt] = dt_now - df[date]
    df[date_alt] = df[date_alt].dt.days


df['LT'] = df['xxx_date_int'] - df['xxx_date_int']
df = df.fillna({'LT': 0})
df['LT'] = df['LT'].astype('int')


#%%Model Setting
x_list = ['LT']
x_list.extend(region_list)

y = df['tgt_flg']
y = y.astype('int')
x = df[x_list]


#%%corr
corr = pd.concat([y, x], axis = 1).corr()
corr_series = corr['types']
ttl = x.sum()

#%%decision tree analysis
def decisiontree(x, y):
    #adjustment required
    tree_clf = DecisionTreeClassifier(max_depth = 6,
                                      min_weight_fraction_leaf = 0.1,
                                      class_weight = {0: 1, 1: 100})
    tree_clf.fit(x, y)
    predicted = tree_clf.predict(x)
    print(len(y), ': ', int(sum(predicted == y) / len(y) * 100),'%')
    y_pred = tree_clf.predict(X_test)
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    
    g = StringIO()
    export_graphviz(tree_clf,
                    out_file = g, 
                    feature_names = x.columns, 
                    class_names = True, 
                    rounded = True, 
                    filled = True)
    graph = pydotplus.graph_from_dot_data(g.getvalue())
    graph.set_fontname('MS UI Gothic')
    #Setting font for Node 
    for node in graph.get_nodes():
        node.set_fontname('MS UI Gothic')
    
    #Setting font for Edges 
    for e in graph.get_edges():
        e.set_fontname('MS UI Gothic')
        
    graph.progs = {'dot': u"C:\\Program Files\\Graphviz\\bin\\dot.exe"}
    display.display(display.Image(graph.create_png()))
    
    import  matplotlib.pyplot as plt
    feature = tree_clf.feature_importances_
    label = X_train.columns
    indices = np.argsort(feature)
    
    fig =plt.figure (figsize = (10,10))

    plt.barh(range(len(feature)), feature[indices])

    plt.yticks(range(len(feature)), label[indices], fontsize=14)
    plt.xticks(fontsize=14)
    plt.ylabel("Feature", fontsize=18)
    plt.xlabel("Feature Importance", fontsize=18)
    
    

print(x.columns)
X_train, X_test, y_train, y_test = train_test_split(x, y)#, random_state=0)

decisiontree(x, y)


#%%model
x = x.astype(int)
model = sm.OLS(y * 200, sm.add_constant(x))
result = model.fit()
print(result.summary())


#%%Summary TBL
df_merge = df[df['tgt_flg'] == 1]

fp = pd.crosstab(df['flg'], [df['region'], df['prefecture']],
                  margins=True)

fp_ = pd.crosstab(df_merge['flg'], [df_merge['region'], df_merge['prefecture']],
                  margins=True)

fp_ratio = fp_ / fp * 100

#%%out-sorced to excel
df_merge.loc[df_merge['tgt_flg'] == 0, 'invalid_column'] = np.nan

pvt = pd.pivot_table(df_merge, index = 'reason', columns = 'corp',
                     values = 'LT', margins=True)

感想

と思っていたらジョブローテでBigQueryを扱うようになった。
環境が整いゴリゴリにSQLで書いてる。
そもそもデータが極めて大きいときは、いくら高性能だろうがサーバーに叶うわけもなく、ローカルで処理すべきではないですよね。

ただやはりモジュールを結合して、グラフ等をそのまま整えられるのは便利だ。
それもクラウドSQLで集計したのをローカルにエクスポートすればいいわけだが。

余談

ちなみに上記の作業において、下記なども参考になる。

youtu.be

Bars

Sometimes bar charts are avoided because they a are common. This is a mistake. Rather, bar charts should be leveraged because they are common, as this means less of a learning curve for your audience. Instead of using their brain power to try to understand how to read the graph, your audience spends it figuring out what information to take away from the visual.

Bar charts are easy for our eyes to read. Our eyes compare the end points of the bars, so it is easy to see quickly which category is the biggest, which is the smallest, and also the incremental difference between categories. Note that, because of how our eyes compare the relative end points of the bars, it is important that bar charts always have a zero baseline (where the x-axis crosses the y-axis at zero), otherwise you get a false visual comparison.