python与SQL比较与转化实例操作

python与SQL比较与转化
由于许多潜在的pandas用户对SQL有一定的了解 ,因此本页面旨在提供一些如何使用pandas执行各种SQL操作的示例。

按照惯例,我们按如下方式导入pandas和NumPy:

import numpy as np
import pandas as pd
url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'
tips = pd.read_csv(url)
tips.head()

Out[11]:

total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4

sql(SELECT)和python进行转化:

 sql:

SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;

python :

tips[['total_bill', 'tip', 'smoker', 'sex']].head(5)

输出结果: 

total_billtipsmokersex
016.991.01NoFemale
110.341.66NoMale
221.013.50NoMale
323.683.31NoMale
424.593.61NoFemale

sql(WHERE)和python进行转化:

sql:

SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;

python:

tips[tips['time']=='Dinner'].head(5)

 结果:

Out[71]:

total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4

另一种实现方法:

is_dinner = tips['time'] == 'Dinner'
is_dinner.value_counts()
tips[is_dinner].head(5)

 结果:

total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4

sql:

SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;

 python:

tips[(tips['time']=='Dinner')&(tips['tip']>5)]
total_billtipsexsmokerdaytimesize
2339.427.58MaleNoSatDinner4
4430.405.60MaleNoSunDinner4
4732.406.00MaleNoSunDinner4
5234.815.20FemaleNoSunDinner4
5948.276.73MaleNoSatDinner4
11629.935.07MaleNoSunDinner4
15529.855.14FemaleNoSunDinner5
17050.8110.00MaleYesSatDinner3
1727.255.15MaleYesSunDinner2
18123.335.65MaleYesSunDinner2
18323.176.50MaleYesSunDinner4
21125.895.16MaleYesSatDinner4
21248.339.00MaleNoSatDinner4
21428.176.50FemaleYesSatDinner3
23929.035.92MaleNoSatDinner3

数据源:

frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
   ....:                       'col2': ['F', np.NaN, 'G', 'H', 'I']})
col1col2
0AF
1BNaN
2NaNG
3CH
4DI

 sql:

SELECT *
FROM frame
WHERE col2 IS NULL;

python:

frame[frame['col2'].isna()]

col1col2
1BNaN

 sql:

SELECT *
FROM frame
WHERE col1 IS NOT NULL;

python:

frame[frame['col1'].notna()]

col1col2
0AF
1BNaN
3CH
4DI

sql(GROUP BY)和python的转化:

sql:

SELECT sex, count(*)
FROM tips
GROUP BY sex;

python:tips.groupby('sex').size()而不是count()

tips.groupby('sex').size()

Out[92]:

sex
Female     87
Male      157
dtype: int64
tips.groupby('sex').count()
total_billtipsmokerdaytimesize
sex
Female878787878787
Male157157157157157157

count()也可以实现:

tips.groupby('sex')['total_bill'].count() 

Out[93]:

sex
Female     87
Male      157
dtype: int64

sql:

SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri   2.734737   19
Sat   2.993103   87
Sun   3.255132   76
Thur  2.771452   62
*/

python:

tips.groupby('day').agg({'tip':np.mean,'day':np.size})
tipday
day
Fri2.73473719
Sat2.99310387
Sun3.25513276
Thur2.77145262

 sql:

SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No     Fri      4  2.812500
       Sat     45  3.102889
       Sun     57  3.167895
       Thur    45  2.673778
Yes    Fri     15  2.714000
       Sat     42  2.875476
       Sun     19  3.516842
       Thur    17  3.030000
*/

python: 

tips.groupby(['smoker', 'day']).agg({ 'day': np.size,'tip': np.mean})
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
#与上面等价
tips.groupby(['smoker', 'day'])['tip'].agg([('avg_balance','size'), ('stdev_balance','mean')])

tipday
smokerday
NoFri2.8125004
Sat3.10288945
Sun3.16789557
Thur2.67377845
YesFri2.71400015
Sat2.87547642
Sun3.51684219
Thur3.03000017

tip
sizemean
smokerday
NoFri4.02.812500
Sat45.03.102889
Sun57.03.167895
Thur45.02.673778
YesFri15.02.714000
Sat42.02.875476
Sun19.03.516842
Thur17.03.030000

JOIN 数据源

In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
   ....:                     'value': np.random.randn(4)})
   ....: 

In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
   ....:                     'value': np.random.randn(4)})

sql: 

--INNER JOIN
SELECT *
FROM df1
INNER JOIN df2
  ON df1.key = df2.key;

 python:

pd.merge(df1,df2,on='key')
keyvalue_xvalue_y
0B0.9357291.373630
1D-0.714890-0.184677
2D-0.7148902.720273

 第二种实现方式:

indexed_df2 = df2.set_index('key')
pd.merge(df1, indexed_df2, left_on='key', right_index=True)

Out[105]:

keyvalue_xvalue_y
1B0.9357291.373630
3D-0.714890-0.184677
3D-0.7148902.720273

 sql:

--LEFT  JOIN
-- show all records from df1
SELECT *
FROM df1
LEFT  JOIN df2
  ON df1.key = df2.key;
pd.merge(df1,df2,on='key',how='left')
keyvalue_xvalue_y
0A0.953422NaN
1B0.9357291.373630
2C-0.790669NaN
3D-0.714890-0.184677
4D-0.7148902.720273
FULL JOIN
-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
  ON df1.key = df2.key;
pd.merge(df1, df2, on='key', how='outer')
keyvalue_xvalue_y
0A0.953422NaN
1B0.9357291.373630
2C-0.790669NaN
3D-0.714890-0.184677
4D-0.7148902.720273
5ENaN-0.180246
UNION
UNION ALL can be performed using concat().

In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
   ....:                     'rank': range(1, 4)})
   ....: 

In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
   ....:                     'rank': [1, 4, 5]})
   ....: 

SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
         city  rank
      Chicago     1
San Francisco     2
New York City     3
      Chicago     1
       Boston     4
  Los Angeles     5
*/
pd.concat([df1,df2])
cityrank
0Chicago1
1San Francisco2
2New York City3
0Chicago1
1Boston4
2Los Angeles5

SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
pd.concat([df1,df2]).drop_duplicates()
cityrank
0Chicago1
1San Francisco2
2New York City3
1Boston4
2Los Angeles5
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;

tips.nlargest(10+5,columns='tip').tail(10)
total_billtipsexsmokerdaytimesize
18323.176.50MaleYesSunDinner4
21428.176.50FemaleYesSatDinner3
4732.406.00MaleNoSunDinner4
23929.035.92MaleNoSatDinner3
8824.715.85MaleNoThurLunch2
18123.335.65MaleYesSunDinner2
4430.405.60MaleNoSunDinner4
5234.815.20FemaleNoSunDinner4
8534.835.17FemaleNoThurLunch4
21125.895.16MaleYesSatDinner4
-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
  SELECT
    t.*,
    ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
  FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;

 方法一

tips.assign(rn=tips.sort_values(['total_bill'],ascending=False).groupby(['day']).cumcount()+1).query('rn<3').sort_values(['day','rn'])
total_billtipsexsmokerdaytimesizern
9540.174.73MaleYesFriDinner41
9028.973.00MaleYesFriDinner22
17050.8110.00MaleYesSatDinner31
21248.339.00MaleNoSatDinner42
15648.175.00MaleNoSunDinner61
18245.353.50MaleYesSunDinner32
19743.115.00FemaleYesThurLunch41
14241.195.00MaleNoThurLunch52
tips.assign(rnk=tips.groupby(['day'])['total_bill'].rank(method='first', ascending=False)).query('rnk < 3').sort_values(['day','rnk'])

total_billtipsexsmokerdaytimesizernk
9540.174.73MaleYesFriDinner41.0
9028.973.00MaleYesFriDinner22.0
17050.8110.00MaleYesSatDinner31.0
21248.339.00MaleNoSatDinner42.0
15648.175.00MaleNoSunDinner61.0
18245.353.50MaleYesSunDinner32.0
19743.115.00FemaleYesThurLunch41.0
14241.195.00MaleNoThurLunch52.0
-- Oracle's RANK() analytic function
SELECT * FROM (
  SELECT
    t.*,
    RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
  FROM tips t
  WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;

tips[tips['tip'] < 2].assign(rnk_min=tips.groupby(['sex'])['tip'].rank(method='min')).query('rnk_min < 3').sort_values(['sex','rnk_min'])
total_billtipsexsmokerdaytimesizernk_min
673.071.00FemaleYesSatDinner11.0
925.751.00FemaleYesFriDinner21.0
1117.251.00FemaleNoSatDinner11.0
23612.601.00MaleYesSatDinner21.0
23732.831.17MaleYesSatDinner22.0
UPDATE
UPDATE tips
SET tip = tip*2
WHERE tip < 2;

tips.loc[tips['tip'] < 2, 'tip'] *= 2

--DELETE
DELETE FROM tips
WHERE tip > 9;
 tips = tips.loc[tips['tip'] <= 9]

开窗函数:

array =  [
    ['a', 2, 29],
    ['a', 1, 18],
    ['a', 0, 18],
    ['b', 2, 25],
    ['b', 4, 12],
    ['c', 6, 21],
]
data = pd.DataFrame(array, columns=["name", "class", "score"])
nameclassscore
0a229
1a118
2a018
3b225
4b412
5c621

# 去重计数:nunique
data['nunique'] = data.groupby('name')['name'].transform('nunique')

# # 求和:sum

data['sum'] = data.groupby('name')['score'].transform('sum')
# # 均值:mean
data['mean'] = data.groupby('name')['score'].transform('mean')

# row_number
data['row_number'] = data.groupby('name')['class'].rank(ascending=True,method='first')  

结果:

nameclassscorenuniquesummeanrow_number
0a22916521.6666673.0
1a11816521.6666672.0
2a01816521.6666671.0
3b22513718.5000001.0
4b41213718.5000002.0
5c62112121.0000001.0

版权声明:本文为baidu_20183817原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。