python与SQL比较与转化
由于许多潜在的pandas用户对SQL有一定的了解 ,因此本页面旨在提供一些如何使用pandas执行各种SQL操作的示例。
按照惯例,我们按如下方式导入pandas和NumPy:
import numpy as np
import pandas as pd
url = 'https://raw.github.com/pandas-dev/pandas/master/pandas/tests/data/tips.csv'
tips = pd.read_csv(url)
tips.head()Out[11]:
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
sql(SELECT)和python进行转化:
sql:
SELECT total_bill, tip, smoker, time
FROM tips
LIMIT 5;python :
tips[['total_bill', 'tip', 'smoker', 'sex']].head(5)输出结果:
| total_bill | tip | smoker | sex | |
|---|---|---|---|---|
| 0 | 16.99 | 1.01 | No | Female |
| 1 | 10.34 | 1.66 | No | Male |
| 2 | 21.01 | 3.50 | No | Male |
| 3 | 23.68 | 3.31 | No | Male |
| 4 | 24.59 | 3.61 | No | Female |
sql(WHERE)和python进行转化:
sql:
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;python:
tips[tips['time']=='Dinner'].head(5)结果:
Out[71]:
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
另一种实现方法:
is_dinner = tips['time'] == 'Dinner'
is_dinner.value_counts()
tips[is_dinner].head(5)结果:
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 0 | 16.99 | 1.01 | Female | No | Sun | Dinner | 2 |
| 1 | 10.34 | 1.66 | Male | No | Sun | Dinner | 3 |
| 2 | 21.01 | 3.50 | Male | No | Sun | Dinner | 3 |
| 3 | 23.68 | 3.31 | Male | No | Sun | Dinner | 2 |
| 4 | 24.59 | 3.61 | Female | No | Sun | Dinner | 4 |
sql:
SELECT *
FROM tips
WHERE time = 'Dinner' AND tip > 5.00;python:
tips[(tips['time']=='Dinner')&(tips['tip']>5)]| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 23 | 39.42 | 7.58 | Male | No | Sat | Dinner | 4 |
| 44 | 30.40 | 5.60 | Male | No | Sun | Dinner | 4 |
| 47 | 32.40 | 6.00 | Male | No | Sun | Dinner | 4 |
| 52 | 34.81 | 5.20 | Female | No | Sun | Dinner | 4 |
| 59 | 48.27 | 6.73 | Male | No | Sat | Dinner | 4 |
| 116 | 29.93 | 5.07 | Male | No | Sun | Dinner | 4 |
| 155 | 29.85 | 5.14 | Female | No | Sun | Dinner | 5 |
| 170 | 50.81 | 10.00 | Male | Yes | Sat | Dinner | 3 |
| 172 | 7.25 | 5.15 | Male | Yes | Sun | Dinner | 2 |
| 181 | 23.33 | 5.65 | Male | Yes | Sun | Dinner | 2 |
| 183 | 23.17 | 6.50 | Male | Yes | Sun | Dinner | 4 |
| 211 | 25.89 | 5.16 | Male | Yes | Sat | Dinner | 4 |
| 212 | 48.33 | 9.00 | Male | No | Sat | Dinner | 4 |
| 214 | 28.17 | 6.50 | Female | Yes | Sat | Dinner | 3 |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
数据源:
frame = pd.DataFrame({'col1': ['A', 'B', np.NaN, 'C', 'D'],
....: 'col2': ['F', np.NaN, 'G', 'H', 'I']})| col1 | col2 | |
|---|---|---|
| 0 | A | F |
| 1 | B | NaN |
| 2 | NaN | G |
| 3 | C | H |
| 4 | D | I |
sql:
SELECT *
FROM frame
WHERE col2 IS NULL;python:
frame[frame['col2'].isna()]| col1 | col2 | |
|---|---|---|
| 1 | B | NaN |
sql:
SELECT *
FROM frame
WHERE col1 IS NOT NULL;python:
frame[frame['col1'].notna()]| col1 | col2 | |
|---|---|---|
| 0 | A | F |
| 1 | B | NaN |
| 3 | C | H |
| 4 | D | I |
sql(GROUP BY)和python的转化:
sql:
SELECT sex, count(*)
FROM tips
GROUP BY sex;python:tips.groupby('sex').size()而不是count()
tips.groupby('sex').size()Out[92]:
sex
Female 87
Male 157
dtype: int64tips.groupby('sex').count()| total_bill | tip | smoker | day | time | size | |
|---|---|---|---|---|---|---|
| sex | ||||||
| Female | 87 | 87 | 87 | 87 | 87 | 87 |
| Male | 157 | 157 | 157 | 157 | 157 | 157 |
count()也可以实现:
tips.groupby('sex')['total_bill'].count() Out[93]:
sex
Female 87
Male 157
dtype: int64sql:
SELECT day, AVG(tip), COUNT(*)
FROM tips
GROUP BY day;
/*
Fri 2.734737 19
Sat 2.993103 87
Sun 3.255132 76
Thur 2.771452 62
*/python:
tips.groupby('day').agg({'tip':np.mean,'day':np.size})| tip | day | |
|---|---|---|
| day | ||
| Fri | 2.734737 | 19 |
| Sat | 2.993103 | 87 |
| Sun | 3.255132 | 76 |
| Thur | 2.771452 | 62 |
sql:
SELECT smoker, day, COUNT(*), AVG(tip)
FROM tips
GROUP BY smoker, day;
/*
smoker day
No Fri 4 2.812500
Sat 45 3.102889
Sun 57 3.167895
Thur 45 2.673778
Yes Fri 15 2.714000
Sat 42 2.875476
Sun 19 3.516842
Thur 17 3.030000
*/python:
tips.groupby(['smoker', 'day']).agg({ 'day': np.size,'tip': np.mean})
tips.groupby(['smoker', 'day']).agg({'tip': [np.size, np.mean]})
#与上面等价
tips.groupby(['smoker', 'day'])['tip'].agg([('avg_balance','size'), ('stdev_balance','mean')])| tip | day | ||
|---|---|---|---|
| smoker | day | ||
| No | Fri | 2.812500 | 4 |
| Sat | 3.102889 | 45 | |
| Sun | 3.167895 | 57 | |
| Thur | 2.673778 | 45 | |
| Yes | Fri | 2.714000 | 15 |
| Sat | 2.875476 | 42 | |
| Sun | 3.516842 | 19 | |
| Thur | 3.030000 | 17 |
| tip | |||
|---|---|---|---|
| size | mean | ||
| smoker | day | ||
| No | Fri | 4.0 | 2.812500 |
| Sat | 45.0 | 3.102889 | |
| Sun | 57.0 | 3.167895 | |
| Thur | 45.0 | 2.673778 | |
| Yes | Fri | 15.0 | 2.714000 |
| Sat | 42.0 | 2.875476 | |
| Sun | 19.0 | 3.516842 | |
| Thur | 17.0 | 3.030000 | |
JOIN 数据源
In [22]: df1 = pd.DataFrame({'key': ['A', 'B', 'C', 'D'],
....: 'value': np.random.randn(4)})
....:
In [23]: df2 = pd.DataFrame({'key': ['B', 'D', 'D', 'E'],
....: 'value': np.random.randn(4)})sql:
--INNER JOIN
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;python:
pd.merge(df1,df2,on='key')| key | value_x | value_y | |
|---|---|---|---|
| 0 | B | 0.935729 | 1.373630 |
| 1 | D | -0.714890 | -0.184677 |
| 2 | D | -0.714890 | 2.720273 |
第二种实现方式:
indexed_df2 = df2.set_index('key')
pd.merge(df1, indexed_df2, left_on='key', right_index=True)Out[105]:
| key | value_x | value_y | |
|---|---|---|---|
| 1 | B | 0.935729 | 1.373630 |
| 3 | D | -0.714890 | -0.184677 |
| 3 | D | -0.714890 | 2.720273 |
sql:
--LEFT JOIN
-- show all records from df1
SELECT *
FROM df1
LEFT JOIN df2
ON df1.key = df2.key;pd.merge(df1,df2,on='key',how='left')| key | value_x | value_y | |
|---|---|---|---|
| 0 | A | 0.953422 | NaN |
| 1 | B | 0.935729 | 1.373630 |
| 2 | C | -0.790669 | NaN |
| 3 | D | -0.714890 | -0.184677 |
| 4 | D | -0.714890 | 2.720273 |
FULL JOIN
-- show all records from both tables
SELECT *
FROM df1
FULL OUTER JOIN df2
ON df1.key = df2.key;pd.merge(df1, df2, on='key', how='outer')| key | value_x | value_y | |
|---|---|---|---|
| 0 | A | 0.953422 | NaN |
| 1 | B | 0.935729 | 1.373630 |
| 2 | C | -0.790669 | NaN |
| 3 | D | -0.714890 | -0.184677 |
| 4 | D | -0.714890 | 2.720273 |
| 5 | E | NaN | -0.180246 |
UNION
UNION ALL can be performed using concat().
In [30]: df1 = pd.DataFrame({'city': ['Chicago', 'San Francisco', 'New York City'],
....: 'rank': range(1, 4)})
....:
In [31]: df2 = pd.DataFrame({'city': ['Chicago', 'Boston', 'Los Angeles'],
....: 'rank': [1, 4, 5]})
....: SELECT city, rank
FROM df1
UNION ALL
SELECT city, rank
FROM df2;
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Chicago 1
Boston 4
Los Angeles 5
*/pd.concat([df1,df2])| city | rank | |
|---|---|---|
| 0 | Chicago | 1 |
| 1 | San Francisco | 2 |
| 2 | New York City | 3 |
| 0 | Chicago | 1 |
| 1 | Boston | 4 |
| 2 | Los Angeles | 5 |
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
pd.concat([df1,df2]).drop_duplicates()| city | rank | |
|---|---|---|
| 0 | Chicago | 1 |
| 1 | San Francisco | 2 |
| 2 | New York City | 3 |
| 1 | Boston | 4 |
| 2 | Los Angeles | 5 |
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;tips.nlargest(10+5,columns='tip').tail(10)| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 183 | 23.17 | 6.50 | Male | Yes | Sun | Dinner | 4 |
| 214 | 28.17 | 6.50 | Female | Yes | Sat | Dinner | 3 |
| 47 | 32.40 | 6.00 | Male | No | Sun | Dinner | 4 |
| 239 | 29.03 | 5.92 | Male | No | Sat | Dinner | 3 |
| 88 | 24.71 | 5.85 | Male | No | Thur | Lunch | 2 |
| 181 | 23.33 | 5.65 | Male | Yes | Sun | Dinner | 2 |
| 44 | 30.40 | 5.60 | Male | No | Sun | Dinner | 4 |
| 52 | 34.81 | 5.20 | Female | No | Sun | Dinner | 4 |
| 85 | 34.83 | 5.17 | Female | No | Thur | Lunch | 4 |
| 211 | 25.89 | 5.16 | Male | Yes | Sat | Dinner | 4 |
-- Oracle's ROW_NUMBER() analytic function
SELECT * FROM (
SELECT
t.*,
ROW_NUMBER() OVER(PARTITION BY day ORDER BY total_bill DESC) AS rn
FROM tips t
)
WHERE rn < 3
ORDER BY day, rn;方法一
tips.assign(rn=tips.sort_values(['total_bill'],ascending=False).groupby(['day']).cumcount()+1).query('rn<3').sort_values(['day','rn'])| total_bill | tip | sex | smoker | day | time | size | rn | |
|---|---|---|---|---|---|---|---|---|
| 95 | 40.17 | 4.73 | Male | Yes | Fri | Dinner | 4 | 1 |
| 90 | 28.97 | 3.00 | Male | Yes | Fri | Dinner | 2 | 2 |
| 170 | 50.81 | 10.00 | Male | Yes | Sat | Dinner | 3 | 1 |
| 212 | 48.33 | 9.00 | Male | No | Sat | Dinner | 4 | 2 |
| 156 | 48.17 | 5.00 | Male | No | Sun | Dinner | 6 | 1 |
| 182 | 45.35 | 3.50 | Male | Yes | Sun | Dinner | 3 | 2 |
| 197 | 43.11 | 5.00 | Female | Yes | Thur | Lunch | 4 | 1 |
| 142 | 41.19 | 5.00 | Male | No | Thur | Lunch | 5 | 2 |
tips.assign(rnk=tips.groupby(['day'])['total_bill'].rank(method='first', ascending=False)).query('rnk < 3').sort_values(['day','rnk'])| total_bill | tip | sex | smoker | day | time | size | rnk | |
|---|---|---|---|---|---|---|---|---|
| 95 | 40.17 | 4.73 | Male | Yes | Fri | Dinner | 4 | 1.0 |
| 90 | 28.97 | 3.00 | Male | Yes | Fri | Dinner | 2 | 2.0 |
| 170 | 50.81 | 10.00 | Male | Yes | Sat | Dinner | 3 | 1.0 |
| 212 | 48.33 | 9.00 | Male | No | Sat | Dinner | 4 | 2.0 |
| 156 | 48.17 | 5.00 | Male | No | Sun | Dinner | 6 | 1.0 |
| 182 | 45.35 | 3.50 | Male | Yes | Sun | Dinner | 3 | 2.0 |
| 197 | 43.11 | 5.00 | Female | Yes | Thur | Lunch | 4 | 1.0 |
| 142 | 41.19 | 5.00 | Male | No | Thur | Lunch | 5 | 2.0 |
-- Oracle's RANK() analytic function
SELECT * FROM (
SELECT
t.*,
RANK() OVER(PARTITION BY sex ORDER BY tip) AS rnk
FROM tips t
WHERE tip < 2
)
WHERE rnk < 3
ORDER BY sex, rnk;tips[tips['tip'] < 2].assign(rnk_min=tips.groupby(['sex'])['tip'].rank(method='min')).query('rnk_min < 3').sort_values(['sex','rnk_min'])| total_bill | tip | sex | smoker | day | time | size | rnk_min | |
|---|---|---|---|---|---|---|---|---|
| 67 | 3.07 | 1.00 | Female | Yes | Sat | Dinner | 1 | 1.0 |
| 92 | 5.75 | 1.00 | Female | Yes | Fri | Dinner | 2 | 1.0 |
| 111 | 7.25 | 1.00 | Female | No | Sat | Dinner | 1 | 1.0 |
| 236 | 12.60 | 1.00 | Male | Yes | Sat | Dinner | 2 | 1.0 |
| 237 | 32.83 | 1.17 | Male | Yes | Sat | Dinner | 2 | 2.0 |
UPDATE
UPDATE tips
SET tip = tip*2
WHERE tip < 2;tips.loc[tips['tip'] < 2, 'tip'] *= 2--DELETE
DELETE FROM tips
WHERE tip > 9; tips = tips.loc[tips['tip'] <= 9]开窗函数:
array = [
['a', 2, 29],
['a', 1, 18],
['a', 0, 18],
['b', 2, 25],
['b', 4, 12],
['c', 6, 21],
]
data = pd.DataFrame(array, columns=["name", "class", "score"])| name | class | score | |
|---|---|---|---|
| 0 | a | 2 | 29 |
| 1 | a | 1 | 18 |
| 2 | a | 0 | 18 |
| 3 | b | 2 | 25 |
| 4 | b | 4 | 12 |
| 5 | c | 6 | 21 |
# 去重计数:nunique
data['nunique'] = data.groupby('name')['name'].transform('nunique')
# # 求和:sum
data['sum'] = data.groupby('name')['score'].transform('sum')
# # 均值:mean
data['mean'] = data.groupby('name')['score'].transform('mean')
# row_number
data['row_number'] = data.groupby('name')['class'].rank(ascending=True,method='first') 结果:
| name | class | score | nunique | sum | mean | row_number | |
|---|---|---|---|---|---|---|---|
| 0 | a | 2 | 29 | 1 | 65 | 21.666667 | 3.0 |
| 1 | a | 1 | 18 | 1 | 65 | 21.666667 | 2.0 |
| 2 | a | 0 | 18 | 1 | 65 | 21.666667 | 1.0 |
| 3 | b | 2 | 25 | 1 | 37 | 18.500000 | 1.0 |
| 4 | b | 4 | 12 | 1 | 37 | 18.500000 | 2.0 |
| 5 | c | 6 | 21 | 1 | 21 | 21.000000 | 1.0 |
版权声明:本文为baidu_20183817原创文章,遵循CC 4.0 BY-SA版权协议,转载请附上原文出处链接和本声明。