Python:项目-电影数据集分析代码 (三十九)
探索电影数据集
在这个项目中,你将尝试使用所学的知识,使用 NumPy、Pandas、matplotlib、seaborn 库中的函数,来对电影数据集进行探索。
下载数据集:
TMDb电影数据
数据集各列名称的含义:
| 列名称 | id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | keywords | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 含义 | 编号 | IMDB 编号 | 知名度 | 预算 | 票房 | 名称 | 主演 | 网站 | 导演 | 宣传词 | 关键词 | 简介 | 时常 | 类别 | 发行公司 | 发行日期 | 投票总数 | 投票均值 | 发行年份 | 预算(调整后) | 票房(调整后) |
请注意,你需要提交该报告导出的 .html、.ipynb 以及 .py 文件。
第一节 数据的导入与处理
在这一部分,你需要编写代码,使用 Pandas 读取数据,并进行预处理。
任务1.1: 导入库以及数据
- 载入需要的库
NumPy、Pandas、matplotlib、seaborn。 - 利用
Pandas库,读取tmdb-movies.csv中的数据,保存为movie_data。
提示:记得使用 notebook 中的魔法指令 %matplotlib inline,否则会导致你接下来无法打印出图像。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
# 使用 notebook 中的魔法指令 %matplotlib inline,否则会导致你接下来无法打印出图像
%matplotlib inline
# 读取数据
movie_data = pd.read_csv('./data/tmdb-movies.csv')
任务1.2: 了解数据
你会接触到各种各样的数据表,因此在读取之后,我们有必要通过一些简单的方法,来了解我们数据表是什么样子的。
- 获取数据表的行列,并打印。
- 使用
.head()、.tail()、.sample()方法,观察、了解数据表的情况。 - 使用
.dtypes属性,来查看各列数据的数据类型。 - 使用
isnull()配合.any()等方法,来查看各列是否存在空值。 - 使用
.describe()方法,看看数据表中数值型的数据是怎么分布的。
# 使用 .head()
movie_data.head(5)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
| 3 | 140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | ... | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
| 4 | 168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | ... | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
5 rows × 21 columns
# print tail
movie_data.tail(5)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 10861 | 21 | tt0060371 | 0.080598 | 0 | 0 | The Endless Summer | Michael Hynson|Robert August|Lord 'Tally Ho' B... | NaN | Bruce Brown | NaN | ... | The Endless Summer, by Bruce Brown, is one of ... | 95 | Documentary | Bruce Brown Films | 6/15/66 | 11 | 7.4 | 1966 | 0.000000 | 0.0 |
| 10862 | 20379 | tt0060472 | 0.065543 | 0 | 0 | Grand Prix | James Garner|Eva Marie Saint|Yves Montand|Tosh... | NaN | John Frankenheimer | Cinerama sweeps YOU into a drama of speed and ... | ... | Grand Prix driver Pete Aron is fired by his te... | 176 | Action|Adventure|Drama | Cherokee Productions|Joel Productions|Douglas ... | 12/21/66 | 20 | 5.7 | 1966 | 0.000000 | 0.0 |
| 10863 | 39768 | tt0060161 | 0.065141 | 0 | 0 | Beregis Avtomobilya | Innokentiy Smoktunovskiy|Oleg Efremov|Georgi Z... | NaN | Eldar Ryazanov | NaN | ... | An insurance agent who moonlights as a carthie... | 94 | Mystery|Comedy | Mosfilm | 1/1/66 | 11 | 6.5 | 1966 | 0.000000 | 0.0 |
| 10864 | 21449 | tt0061177 | 0.064317 | 0 | 0 | What's Up, Tiger Lily? | Tatsuya Mihashi|Akiko Wakabayashi|Mie Hama|Joh... | NaN | Woody Allen | WOODY ALLEN STRIKES BACK! | ... | In comic Woody Allen's film debut, he took the... | 80 | Action|Comedy | Benedict Pictures Corp. | 11/2/66 | 22 | 5.4 | 1966 | 0.000000 | 0.0 |
| 10865 | 22293 | tt0060666 | 0.035919 | 19000 | 0 | Manos: The Hands of Fate | Harold P. Warren|Tom Neyman|John Reynolds|Dian... | NaN | Harold P. Warren | It's Shocking! It's Beyond Your Imagination! | ... | A family gets lost on the road and stumbles up... | 74 | Horror | Norm-Iris | 11/15/66 | 15 | 1.5 | 1966 | 127642.279154 | 0.0 |
5 rows × 21 columns
# print('sample:')
movie_data.sample()
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 5477 | 112205 | tt2404311 | 1.483329 | 30000000 | 36894225 | The Family | Robert De Niro|Michelle Pfeiffer|Dianna Agron|... | NaN | Luc Besson | Some call it organized crime. Others call it f... | ... | The Manzoni family, a notorious mafia clan, is... | 111 | Crime|Comedy|Action | Canal Plus|TF1 Films Production|Grive Producti... | 9/13/13 | 710 | 6.1 | 2013 | 2.808100e+07 | 3.453423e+07 |
1 rows × 21 columns
# 3、使用 .dtypes 属性,来查看各列数据的数据类型
movie_data.dtypes
id int64
imdb_id object
popularity float64
budget int64
revenue int64
original_title object
cast object
homepage object
director object
tagline object
keywords object
overview object
runtime int64
genres object
production_companies object
release_date object
vote_count int64
vote_average float64
release_year int64
budget_adj float64
revenue_adj float64
dtype: object
# 4.使用 isnull() 配合 .any() 等方法,来查看各列是否存在空值。
movie_data.isnull().any()
id False
imdb_id True
popularity False
budget False
revenue False
original_title False
cast True
homepage True
director True
tagline True
keywords True
overview True
runtime False
genres True
production_companies True
release_date False
vote_count False
vote_average False
release_year False
budget_adj False
revenue_adj False
dtype: bool
# 5.使用 .describe() 方法,看看数据表中数值型的数据是怎么分布的。
movie_data.describe()
| id | popularity | budget | revenue | runtime | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|
| count | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 | 10866.000000 | 10866.000000 | 10866.000000 | 10866.000000 | 1.086600e+04 | 1.086600e+04 |
| mean | 66064.177434 | 0.646441 | 1.462570e+07 | 3.982332e+07 | 102.070863 | 217.389748 | 5.974922 | 2001.322658 | 1.755104e+07 | 5.136436e+07 |
| std | 92130.136561 | 1.000185 | 3.091321e+07 | 1.170035e+08 | 31.381405 | 575.619058 | 0.935142 | 12.812941 | 3.430616e+07 | 1.446325e+08 |
| min | 5.000000 | 0.000065 | 0.000000e+00 | 0.000000e+00 | 0.000000 | 10.000000 | 1.500000 | 1960.000000 | 0.000000e+00 | 0.000000e+00 |
| 25% | 10596.250000 | 0.207583 | 0.000000e+00 | 0.000000e+00 | 90.000000 | 17.000000 | 5.400000 | 1995.000000 | 0.000000e+00 | 0.000000e+00 |
| 50% | 20669.000000 | 0.383856 | 0.000000e+00 | 0.000000e+00 | 99.000000 | 38.000000 | 6.000000 | 2006.000000 | 0.000000e+00 | 0.000000e+00 |
| 75% | 75610.000000 | 0.713817 | 1.500000e+07 | 2.400000e+07 | 111.000000 | 145.750000 | 6.600000 | 2011.000000 | 2.085325e+07 | 3.369710e+07 |
| max | 417859.000000 | 32.985763 | 4.250000e+08 | 2.781506e+09 | 900.000000 | 9767.000000 | 9.200000 | 2015.000000 | 4.250000e+08 | 2.827124e+09 |
任务1.3: 清理数据
在真实的工作场景中,数据处理往往是最为费时费力的环节。但是幸运的是,我们提供给大家的 tmdb 数据集非常的「干净」,不需要大家做特别多的数据清洗以及处理工作。在这一步中,你的核心的工作主要是对数据表中的空值进行处理。你可以使用 .fillna() 来填补空值,当然也可以使用 .dropna() 来丢弃数据表中包含空值的某些行或者列。
任务:使用适当的方法来清理空值,并将得到的数据保存。
#movie_data.info()
#通过上边的方法 movie_data.isnull().any(),可以找出NaN的列,可以对无关紧要的列做删除
#首先删除缺失较为严重且无关紧要的列:'homepage','tagline','keywords','production_companies'
#然后再删除轻微缺失的行:‘imdb_id’,‘cast’,‘director’,‘overview’,‘genres’
movie_data = movie_data.drop(columns = ['homepage','tagline','keywords','production_companies'])
movie_data.dropna(axis = 0, inplace = True)
# movie_data.isnull().any()
movie_data.head(5)
| id | imdb_id | popularity | budget | revenue | original_title | cast | director | overview | runtime | genres | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | George Miller | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | Robert Schwentke | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
| 3 | 140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | J.J. Abrams | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
| 4 | 168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | James Wan | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
第二节 根据指定要求读取数据
相比 Excel 等数据分析软件,Pandas 的一大特长在于,能够轻松地基于复杂的逻辑选择合适的数据。因此,如何根据指定的要求,从数据表当获取适当的数据,是使用 Pandas 中非常重要的技能,也是本节重点考察大家的内容。
任务2.1: 简单读取
- 读取数据表中名为
id、popularity、budget、runtime、vote_average列的数据。 - 读取数据表中前1~20行以及48、49行的数据。
- 读取数据表中第50~60行的
popularity那一列的数据。
要求:每一个语句只能用一行代码实现。
# 读取数据表中名为 id、popularity、budget、runtime、vote_average 列的数据, 取出 5 条测试打印
movie_data[['id','popularity','budget','runtime','vote_average']].head(5)
| id | popularity | budget | runtime | vote_average | |
|---|---|---|---|---|---|
| 0 | 135397 | 32.985763 | 150000000 | 124 | 6.5 |
| 1 | 76341 | 28.419936 | 150000000 | 120 | 7.1 |
| 2 | 262500 | 13.112507 | 110000000 | 119 | 6.3 |
| 3 | 140607 | 11.173104 | 200000000 | 136 | 7.5 |
| 4 | 168259 | 9.335014 | 190000000 | 137 | 7.3 |
# 读取数据表中前1~20行以及48、49行的数据,即追加行
movie_data[0:20].append(movie_data[47:49])
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
| 3 | 140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | ... | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
| 4 | 168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | ... | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
| 5 | 281957 | tt1663202 | 9.110700 | 135000000 | 532950503 | The Revenant | Leonardo DiCaprio|Tom Hardy|Will Poulter|Domhn... | http://www.foxmovies.com/movies/the-revenant | Alejandro González Iñárritu | (n. One who has returned, as if from the dead.) | ... | In the 1820s, a frontiersman, Hugh Glass, sets... | 156 | Western|Drama|Adventure|Thriller | Regency Enterprises|Appian Way|CatchPlay|Anony... | 12/25/15 | 3929 | 7.2 | 2015 | 1.241999e+08 | 4.903142e+08 |
| 6 | 87101 | tt1340138 | 8.654359 | 155000000 | 440603537 | Terminator Genisys | Arnold Schwarzenegger|Jason Clarke|Emilia Clar... | http://www.terminatormovie.com/ | Alan Taylor | Reset the future | ... | The year is 2029. John Connor, leader of the r... | 125 | Science Fiction|Action|Thriller|Adventure | Paramount Pictures|Skydance Productions | 6/23/15 | 2598 | 5.8 | 2015 | 1.425999e+08 | 4.053551e+08 |
| 7 | 286217 | tt3659388 | 7.667400 | 108000000 | 595380321 | The Martian | Matt Damon|Jessica Chastain|Kristen Wiig|Jeff ... | http://www.foxmovies.com/movies/the-martian | Ridley Scott | Bring Him Home | ... | During a manned mission to Mars, Astronaut Mar... | 141 | Drama|Adventure|Science Fiction | Twentieth Century Fox Film Corporation|Scott F... | 9/30/15 | 4572 | 7.6 | 2015 | 9.935996e+07 | 5.477497e+08 |
| 8 | 211672 | tt2293640 | 7.404165 | 74000000 | 1156730962 | Minions | Sandra Bullock|Jon Hamm|Michael Keaton|Allison... | http://www.minionsmovie.com/ | Kyle Balda|Pierre Coffin | Before Gru, they had a history of bad bosses | ... | Minions Stuart, Kevin and Bob are recruited by... | 91 | Family|Animation|Adventure|Comedy | Universal Pictures|Illumination Entertainment | 6/17/15 | 2893 | 6.5 | 2015 | 6.807997e+07 | 1.064192e+09 |
| 9 | 150540 | tt2096673 | 6.326804 | 175000000 | 853708609 | Inside Out | Amy Poehler|Phyllis Smith|Richard Kind|Bill Ha... | http://movies.disney.com/inside-out | Pete Docter | Meet the little voices inside your head. | ... | Growing up can be a bumpy road, and it's no ex... | 94 | Comedy|Animation|Family | Walt Disney Pictures|Pixar Animation Studios|W... | 6/9/15 | 3935 | 8.0 | 2015 | 1.609999e+08 | 7.854116e+08 |
| 10 | 206647 | tt2379713 | 6.200282 | 245000000 | 880674609 | Spectre | Daniel Craig|Christoph Waltz|Léa Seydoux|Ralp... | http://www.sonypictures.com/movies/spectre/ | Sam Mendes | A Plan No One Escapes | ... | A cryptic message from Bond’s past sends him... | 148 | Action|Adventure|Crime | Columbia Pictures|Danjaq|B24 | 10/26/15 | 3254 | 6.2 | 2015 | 2.253999e+08 | 8.102203e+08 |
| 11 | 76757 | tt1617661 | 6.189369 | 176000003 | 183987723 | Jupiter Ascending | Mila Kunis|Channing Tatum|Sean Bean|Eddie Redm... | http://www.jupiterascending.com | Lana Wachowski|Lilly Wachowski | Expand your universe. | ... | In a universe where human genetic material is ... | 124 | Science Fiction|Fantasy|Action|Adventure | Village Roadshow Pictures|Dune Entertainment|A... | 2/4/15 | 1937 | 5.2 | 2015 | 1.619199e+08 | 1.692686e+08 |
| 12 | 264660 | tt0470752 | 6.118847 | 15000000 | 36869414 | Ex Machina | Domhnall Gleeson|Alicia Vikander|Oscar Isaac|S... | http://exmachina-movie.com/ | Alex Garland | There is nothing more human than the will to s... | ... | Caleb, a 26 year old coder at the world's larg... | 108 | Drama|Science Fiction | DNA Films|Universal Pictures International (UP... | 1/21/15 | 2854 | 7.6 | 2015 | 1.379999e+07 | 3.391985e+07 |
| 13 | 257344 | tt2120120 | 5.984995 | 88000000 | 243637091 | Pixels | Adam Sandler|Michelle Monaghan|Peter Dinklage|... | http://www.pixels-movie.com/ | Chris Columbus | Game On. | ... | Video game experts are recruited by the milita... | 105 | Action|Comedy|Science Fiction | Columbia Pictures|Happy Madison Productions | 7/16/15 | 1575 | 5.8 | 2015 | 8.095996e+07 | 2.241460e+08 |
| 14 | 99861 | tt2395427 | 5.944927 | 280000000 | 1405035767 | Avengers: Age of Ultron | Robert Downey Jr.|Chris Hemsworth|Mark Ruffalo... | http://marvel.com/movies/movie/193/avengers_ag... | Joss Whedon | A New Age Has Come. | ... | When Tony Stark tries to jumpstart a dormant p... | 141 | Action|Adventure|Science Fiction | Marvel Studios|Prime Focus|Revolution Sun Studios | 4/22/15 | 4304 | 7.4 | 2015 | 2.575999e+08 | 1.292632e+09 |
| 15 | 273248 | tt3460252 | 5.898400 | 44000000 | 155760117 | The Hateful Eight | Samuel L. Jackson|Kurt Russell|Jennifer Jason ... | http://thehatefuleight.com/ | Quentin Tarantino | No one comes up here without a damn good reason. | ... | Bounty hunters seek shelter from a raging bliz... | 167 | Crime|Drama|Mystery|Western | Double Feature Films|The Weinstein Company|Fil... | 12/25/15 | 2389 | 7.4 | 2015 | 4.047998e+07 | 1.432992e+08 |
| 16 | 260346 | tt2446042 | 5.749758 | 48000000 | 325771424 | Taken 3 | Liam Neeson|Forest Whitaker|Maggie Grace|Famke... | http://www.taken3movie.com/ | Olivier Megaton | It Ends Here | ... | Ex-government operative Bryan Mills finds his ... | 109 | Crime|Action|Thriller | Twentieth Century Fox Film Corporation|M6 Film... | 1/1/15 | 1578 | 6.1 | 2015 | 4.415998e+07 | 2.997096e+08 |
| 17 | 102899 | tt0478970 | 5.573184 | 130000000 | 518602163 | Ant-Man | Paul Rudd|Michael Douglas|Evangeline Lilly|Cor... | http://marvel.com/movies/movie/180/ant-man | Peyton Reed | Heroes Don't Get Any Bigger | ... | Armed with the astonishing ability to shrink i... | 115 | Science Fiction|Action|Adventure | Marvel Studios | 7/14/15 | 3779 | 7.0 | 2015 | 1.195999e+08 | 4.771138e+08 |
| 18 | 150689 | tt1661199 | 5.556818 | 95000000 | 542351353 | Cinderella | Lily James|Cate Blanchett|Richard Madden|Helen... | 0 | Kenneth Branagh | Midnight is just the beginning. | ... | When her father unexpectedly passes away, youn... | 112 | Romance|Fantasy|Family|Drama | Walt Disney Pictures|Genre Films|Beagle Pug Fi... | 3/12/15 | 1495 | 6.8 | 2015 | 8.739996e+07 | 4.989630e+08 |
| 19 | 131634 | tt1951266 | 5.476958 | 160000000 | 650523427 | The Hunger Games: Mockingjay - Part 2 | Jennifer Lawrence|Josh Hutcherson|Liam Hemswor... | http://www.thehungergames.movie/ | Francis Lawrence | The fire will burn forever. | ... | With the nation of Panem in a full scale war, ... | 136 | War|Adventure|Science Fiction | Studio Babelsberg|StudioCanal|Lionsgate|Walt D... | 11/18/15 | 2380 | 6.5 | 2015 | 1.471999e+08 | 5.984813e+08 |
| 47 | 286565 | tt3622592 | 2.968254 | 12000000 | 85512300 | Paper Towns | Nat Wolff|Cara Delevingne|Halston Sage|Justice... | 0 | Jake Schreier | Get Lost. Get Found. | ... | Quentin Jacobsen has spent a lifetime loving t... | 109 | Drama|Mystery|Romance | Fox 2000 Pictures | 7/9/15 | 1252 | 6.2 | 2015 | 1.104000e+07 | 7.867128e+07 |
| 48 | 265208 | tt2231253 | 2.932340 | 30000000 | 0 | Wild Card | Jason Statham|Michael Angarano|Milo Ventimigli... | 0 | Simon West | Never bet against a man with a killer hand. | ... | When a Las Vegas bodyguard with lethal skills ... | 92 | Thriller|Crime|Drama | Current Entertainment|Lionsgate|Sierra / Affin... | 1/14/15 | 481 | 5.3 | 2015 | 2.759999e+07 | 0.000000e+00 |
22 rows × 21 columns
# 读取数据表中第50~60行的 popularity 那一列的数据,注意不包含尾,第几行实际下标为(index - 1,index_end)
# 第 50~60行 即为 49~60
movie_data[49:60][['popularity']]
| popularity | |
|---|---|
| 49 | 2.885126 |
| 50 | 2.883233 |
| 51 | 2.814802 |
| 52 | 2.798017 |
| 53 | 2.793297 |
| 54 | 2.614499 |
| 55 | 2.584264 |
| 56 | 2.578919 |
| 57 | 2.575711 |
| 58 | 2.557859 |
| 59 | 2.550747 |
任务2.2: 逻辑读取(Logical Indexing)
- 读取数据表中
popularity大于5 的所有数据。 - 读取数据表中
popularity大于5 的所有数据且发行年份在1996年之后的所有数据。
提示:Pandas 中的逻辑运算符如 &、|,分别代表且以及或。
要求:请使用 Logical Indexing实现。
# 读取数据表中 popularity 大于5 的所有数据,取出前5个打印
movie_data[movie_data.popularity > 5].head(5)
| id | imdb_id | popularity | budget | revenue | original_title | cast | director | overview | runtime | genres | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | Colin Trevorrow | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | George Miller | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | Robert Schwentke | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
| 3 | 140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | J.J. Abrams | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
| 4 | 168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | James Wan | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
# 读取数据表中 popularity 大于5 的所有数据且发行年份在1996年之后的所有数据。
movie_data[(movie_data.popularity > 5) & (movie_data.release_year >= 1996)].head(5)
| id | imdb_id | popularity | budget | revenue | original_title | cast | homepage | director | tagline | ... | overview | runtime | genres | production_companies | release_date | vote_count | vote_average | release_year | budget_adj | revenue_adj | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 135397 | tt0369610 | 32.985763 | 150000000 | 1513528810 | Jurassic World | Chris Pratt|Bryce Dallas Howard|Irrfan Khan|Vi... | http://www.jurassicworld.com/ | Colin Trevorrow | The park is open. | ... | Twenty-two years after the events of Jurassic ... | 124 | Action|Adventure|Science Fiction|Thriller | Universal Studios|Amblin Entertainment|Legenda... | 6/9/15 | 5562 | 6.5 | 2015 | 1.379999e+08 | 1.392446e+09 |
| 1 | 76341 | tt1392190 | 28.419936 | 150000000 | 378436354 | Mad Max: Fury Road | Tom Hardy|Charlize Theron|Hugh Keays-Byrne|Nic... | http://www.madmaxmovie.com/ | George Miller | What a Lovely Day. | ... | An apocalyptic story set in the furthest reach... | 120 | Action|Adventure|Science Fiction|Thriller | Village Roadshow Pictures|Kennedy Miller Produ... | 5/13/15 | 6185 | 7.1 | 2015 | 1.379999e+08 | 3.481613e+08 |
| 2 | 262500 | tt2908446 | 13.112507 | 110000000 | 295238201 | Insurgent | Shailene Woodley|Theo James|Kate Winslet|Ansel... | http://www.thedivergentseries.movie/#insurgent | Robert Schwentke | One Choice Can Destroy You | ... | Beatrice Prior must confront her inner demons ... | 119 | Adventure|Science Fiction|Thriller | Summit Entertainment|Mandeville Films|Red Wago... | 3/18/15 | 2480 | 6.3 | 2015 | 1.012000e+08 | 2.716190e+08 |
| 3 | 140607 | tt2488496 | 11.173104 | 200000000 | 2068178225 | Star Wars: The Force Awakens | Harrison Ford|Mark Hamill|Carrie Fisher|Adam D... | http://www.starwars.com/films/star-wars-episod... | J.J. Abrams | Every generation has a story. | ... | Thirty years after defeating the Galactic Empi... | 136 | Action|Adventure|Science Fiction|Fantasy | Lucasfilm|Truenorth Productions|Bad Robot | 12/15/15 | 5292 | 7.5 | 2015 | 1.839999e+08 | 1.902723e+09 |
| 4 | 168259 | tt2820852 | 9.335014 | 190000000 | 1506249360 | Furious 7 | Vin Diesel|Paul Walker|Jason Statham|Michelle ... | http://www.furious7.com/ | James Wan | Vengeance Hits Home | ... | Deckard Shaw seeks revenge against Dominic Tor... | 137 | Action|Crime|Thriller | Universal Pictures|Original Film|Media Rights ... | 4/1/15 | 2947 | 7.3 | 2015 | 1.747999e+08 | 1.385749e+09 |
5 rows × 21 columns
任务2.3: 分组读取
要求:使用 Groupby 命令实现。
# 计算每年的年度收益
movie_data.groupby(['release_year'])['revenue'].mean().head(5)
release_year
1960 4.531406e+06
1961 1.089420e+07
1962 6.736870e+06
1963 5.511911e+06
1964 8.118614e+06
Name: revenue, dtype: float64
# 对 director 进行分组,使用 .agg 获得 popularity 的均值,从高到低排列, 只打印Top 10 导演, sort_values排序方法
movie_data.groupby(['director'])['popularity'].mean().sort_values(ascending=False).head(10)
director
Colin Trevorrow 16.696886
Joe Russo|Anthony Russo 12.971027
Chad Stahelski|David Leitch 11.422751
Don Hall|Chris Williams 8.691294
Juno John Lee 8.411577
Kyle Balda|Pierre Coffin 7.404165
Alan Taylor 6.883129
Peter Richardson 6.668990
Pete Docter 6.326804
Christopher Nolan 6.195521
Name: popularity, dtype: float64
第三节 绘图与可视化
接着你要尝试对你的数据进行图像的绘制以及可视化。这一节最重要的是,你能够选择合适的图像,对特定的可视化目标进行可视化。所谓可视化的目标,是你希望从可视化的过程中,观察到怎样的信息以及变化。例如,观察票房随着时间的变化、哪个导演最受欢迎等。
| 可视化的目标 | 可以使用的图像 |
|---|---|
| 表示某一属性数据的分布 | 饼图、直方图、散点图 |
| 表示某一属性数据随着某一个变量变化 | 条形图、折线图、热力图 |
| 比较多个属性的数据之间的关系 | 散点图、小提琴图、堆积条形图、堆积折线图 |
在这个部分,你需要根据题目中问题,选择适当的可视化图像进行绘制,并进行相应的分析。对于选做题,他们具有一定的难度,你可以尝试挑战一下~
任务3.1:对 popularity 最高的20名电影绘制其 popularity 值。
# movie_data[['original_title','popularity']].sort_values(by='popularity', ascending=False)[:20]
# top_movies = movie_data.set_index('original_title')['popularity'].sort_values()[-20:]
top_movies = movie_data[['original_title','popularity']].sort_values(by='popularity', ascending=False)[:20].sort_values(by='popularity', ascending=True)
# 设置颜色
# @see doc:
# https://stackoverflow.com/questions/18973404/setting-different-bar-color-in-matplotlib-python
my_colors = 'rgbkymc' #red, green, blue, black, etc.
# 设置索引为 original_title 列,使用barh
top_movies.set_index('original_title').plot(kind='barh', color=my_colors)
plt.xlabel('Popularity')
plt.ylabel('Original Title')
plt.title('Top 20 Movies by Popularity');

任务3.2:分析电影净利润(票房-成本)随着年份变化的情况,并简单进行分析。
# 增加新的列,利润=票房-成本
movie_data['profit'] = movie_data['revenue_adj'] - movie_data['budget_adj']
# movie_data.head(5)
# 以年为组,统计年均利润
movie_data.groupby(['release_year'])['profit'].mean().plot(kind='line', figsize=(16, 8))
plt.ylabel('profit_mean')
Text(0,0.5,'profit_mean')

# 以年为组,计算标准差
movie_data.groupby(['release_year'])['profit'].std().plot(kind='line', figsize=(16, 8))
plt.ylabel('profit_std')
Text(0,0.5,'profit_std')

# 统计年发行量
movie_data.groupby('release_year')['original_title'].count().plot(kind='line', figsize=(16, 8));
plt.ylabel('profit_sum')
Text(0,0.5,'profit_sum')

# 分析电影净利润
# 1、电影利润在1960~1980年代有很大的波动,随着时间的推移,后来趋于稳定
# 2、随着每年电影的产量逐步提高,每部电影的平均净利润逐年减少
[选做]任务3.3:选择最多产的10位导演(电影数量最多的),绘制他们排行前3的三部电影的票房情况,并简要进行分析。
# 1、先选择多产的前10位导演
# new_movie_data = movie_data[['original_title', 'revenue', 'director']]
# new_movie_data.groupby(new_movie_data['director'])['original_title'].count().sort_values(ascending=False).head(10)
tmp = movie_data['director'].str.split('|', expand=True).stack().reset_index(level=1, drop=True).rename('director')
movie_data_split = movie_data[['original_title', 'revenue']].join(tmp)
movie_data_split.groupby(movie_data_split['director'])['original_title'].count().sort_values(ascending=False).head(10)
director
Woody Allen 46
Clint Eastwood 34
Martin Scorsese 31
Steven Spielberg 30
Ridley Scott 23
Steven Soderbergh 23
Ron Howard 22
Joel Schumacher 21
Tim Burton 20
Brian De Palma 20
Name: original_title, dtype: int64
# 2、获取每位导演的票房前 3 部电影
directors = list(movie_data_split.groupby(movie_data_split['director'])['original_title'].count().sort_values(ascending=False).head(10).index)
f1, f2, f3, f4 = [],[],[],[]
for director in directors:
#每个导演 top 3 的电影,分别取出电影名称、票房、导演、评价
a = list(movie_data[(movie_data['director']== director)]['vote_average'].sort_values(ascending=False)[0:3].rename(index = movie_data['original_title']).index)
b = list(movie_data[(movie_data['director']== director)]['vote_average'].sort_values(ascending=False)[0:3].rename(index = movie_data['revenue_adj']).index)
c = list(movie_data[(movie_data['director']== director)]['vote_average'].sort_values(ascending=False)[0:3].rename(index = movie_data['director']).index)
d = list(movie_data[(movie_data['director']== director)]['vote_average'].sort_values(ascending=False)[0:3].rename(index = movie_data['vote_average']).index)
f1 += a
f2 += b
f3 += c
f4 += d
items = {'director': pd.Series(f3),'original_title': pd.Series(f1), 'revenue_adj': pd.Series(f2),'vote_average': pd.Series(f4)}
df = pd.DataFrame(items)
#将导演设置为索引
df.set_index('director',inplace = True)
df
| original_title | revenue_adj | vote_average | |
|---|---|---|---|
| director | |||
| Woody Allen | Manhattan | 1.200223e+08 | 7.7 |
| Woody Allen | Annie Hall | 1.376203e+08 | 7.6 |
| Woody Allen | Hannah and Her Sisters | 7.974345e+07 | 7.3 |
| Clint Eastwood | Million Dollar Baby | 2.502418e+08 | 7.6 |
| Clint Eastwood | Gran Torino | 2.734101e+08 | 7.6 |
| Clint Eastwood | Unforgiven | 2.473345e+08 | 7.5 |
| Martin Scorsese | The Last Waltz | 1.076189e+06 | 8.0 |
| Martin Scorsese | Goodfellas | 7.816519e+07 | 8.0 |
| Martin Scorsese | George Harrison: Living in the Material World | 0.000000e+00 | 8.0 |
| Steven Spielberg | Schindler's List | 4.849410e+08 | 8.1 |
| Steven Spielberg | Saving Private Ryan | 6.445564e+08 | 7.7 |
| Steven Spielberg | Catch Me If You Can | 4.268546e+08 | 7.6 |
| Ridley Scott | Blade Runner | 7.404548e+07 | 7.7 |
| Ridley Scott | Gladiator | 5.795065e+08 | 7.7 |
| Ridley Scott | The Martian | 5.477497e+08 | 7.6 |
| Steven Soderbergh | Ocean's Eleven | 5.550528e+08 | 7.0 |
| Steven Soderbergh | Erin Brockovich | 3.245143e+08 | 6.9 |
| Steven Soderbergh | The Limey | 4.179939e+06 | 6.6 |
| Ron Howard | Rush | 8.447479e+07 | 7.7 |
| Ron Howard | A Beautiful Mind | 3.861237e+08 | 7.5 |
| Ron Howard | Apollo 13 | 5.083337e+08 | 7.1 |
| Joel Schumacher | Falling Down | 6.174274e+07 | 7.0 |
| Joel Schumacher | A Time to Kill | 2.116828e+08 | 7.0 |
| Joel Schumacher | The Phantom of the Opera | 1.785337e+08 | 6.8 |
| Tim Burton | Vincent | 0.000000e+00 | 7.9 |
| Tim Burton | Edward Scissorhands | 8.845162e+07 | 7.4 |
| Tim Burton | Big Fish | 1.457024e+08 | 7.4 |
| Brian De Palma | Scarface | 1.442422e+08 | 7.8 |
| Brian De Palma | Phantom of the Paradise | 0.000000e+00 | 7.5 |
| Brian De Palma | The Untouchables | 1.463691e+08 | 7.5 |
[选做]任务3.4:分析1968年~2015年六月电影的数量的变化。
#获取1968年~2015年
new_data = movie_data
sel_year = new_data['release_year'].between(1968,2015,inclusive = True)
#获取6月份
sel_june = list(map(lambda x: (pd.to_datetime(x).month) == 6, new_data['release_date']))
new_data[sel_year&sel_june]['release_year'].value_counts().sort_index().plot(kind='line', figsize=(20, 10), lw = 3);
plt.xlabel('release_year', fontsize = 16);
plt.ylabel('movies_in_June_from_1968_to_2015', fontsize = 16);
plt.grid(True)

#1968年~2015年六月电影的数量的变化:大体上为上升趋势,短时期内有回落现象,且进入2000后上升趋势加快
[选做]任务3.5:分析1968年~2015年六月电影 Comedy 和 Drama 两类电影的数量的变化。
为者常成,行者常至
自由转载-非商用-非衍生-保持署名(创意共享3.0许可证)