Aplicación 1.1: Gestión y visualización de datos
Gramática básica del tidyverse en R y Python
En esta aplicación se pondrán ejemplos básicos de gestión y visualización de datos en R y Python usando la “filosofía tidyverse”, una forma de trabajar dentro de cada lenguaje con el objetivo de estructurar los datos originales para un tratamiento estadístico posterior:
“A grandes rasgos, el tidyverse es un lenguaje para resolver los retos de la ciencia de datos […]. Su objetivo principal es facilitar una conversación entre un humano y un ordenador acerca de los datos. De forma menos abstracta, el tidyverse es una colección de […] librerías que comparten una filosofía de diseño de alto nivel y unas estructuras gramatical y de datos de bajo nivel, de modo que aprender una librería facilita el aprendizaje de la siguiente.” (Traducido al español de Wickham et al., 2019)
En las siguientes páginas web se pueden encontrar los detalles sobre la gestión (limpieza y preparación) ’a la tidyverse’ de bases de datos y realización de gráficos en R:
Datos ordenados: https://tidyr.tidyverse.org/articles/tidy-data.html
La macro-librería
tidyverse
en R:Varias lecciones para entender cómo funciona el tidyverse en R:
https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/
https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-2-data-visualisation/
https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-4-when-to-purrr/
En términos generales, para aprender cómo hacer “ciencia de datos” con R se puede consultar el libro de de Wickham, Çetinkaya-Rundel y Grolemund: https://r4ds.hadley.nz/ (la versión en español del libro está disponible en https://es.r4ds.hadley.nz/).
Sobre los datos utilizados en esta aplicación:
Gapminder (https://www.gapminder.org/fw/world-health-chart/): los datos por países se han extraído de la base de datos del Banco Mundial (https://data.worldbank.org/), usando la librería
WDI
(https://vincentarelbundock.github.io/WDI/index.html).NYC_Flights_2013 (https://github.com/tidyverse/nycflights13): vuelos con salida en Nueva York y destino Estados Unidos, Puerto Rico y las Islas Vírgenes.
Code
# Lectura de librerías
library(tidyverse)
tidyverse_packages()
[1] "broom" "conflicted" "cli" "dbplyr"
[5] "dplyr" "dtplyr" "forcats" "ggplot2"
[9] "googledrive" "googlesheets4" "haven" "hms"
[13] "httr" "jsonlite" "lubridate" "magrittr"
[17] "modelr" "pillar" "purrr" "ragg"
[21] "readr" "readxl" "reprex" "rlang"
[25] "rstudioapi" "rvest" "stringr" "tibble"
[29] "tidyr" "xml2" "tidyverse"
Code
library(reticulate)
Code
# Lectura de datos
# (https://es.r4ds.hadley.nz/10-tibble.html)
# (https://es.r4ds.hadley.nz/11-import.html)
<- read_csv("data/GAPMINDER.csv")
gapminder dim(gapminder)
[1] 13454 8
Code
class(gapminder)
[1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
Code
# Manejo y transformación de datos
# (https://es.r4ds.hadley.nz/05-transform.html)
#
# Cambio de nombres de variables
<- gapminder %>%
gapminder rename(year = date,
gdpPercap = NY.GDP.PCAP.CD,
lifeExp = SP.DYN.LE00.IN,
pop = SP.POP.TOTL)
# Datos iniciales y finales
head(gapminder)
# A tibble: 6 × 8
iso2c iso3c country year gdpPercap lifeExp pop region
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 AW ABW Aruba 1960 NA 64.2 54608 Latin America & Caribbean
2 AW ABW Aruba 1961 NA 64.5 55811 Latin America & Caribbean
3 AW ABW Aruba 1962 NA 64.8 56682 Latin America & Caribbean
4 AW ABW Aruba 1963 NA 65.1 57475 Latin America & Caribbean
5 AW ABW Aruba 1964 NA 65.3 58178 Latin America & Caribbean
6 AW ABW Aruba 1965 NA 65.5 58782 Latin America & Caribbean
Code
tail(gapminder)
# A tibble: 6 × 8
iso2c iso3c country year gdpPercap lifeExp pop region
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
1 ZW ZWE Zimbabwe 2016 1422. 60.3 14452704 Sub-Saharan Africa
2 ZW ZWE Zimbabwe 2017 1192. 60.7 14751101 Sub-Saharan Africa
3 ZW ZWE Zimbabwe 2018 2269. 61.4 15052184 Sub-Saharan Africa
4 ZW ZWE Zimbabwe 2019 1422. 61.3 15354608 Sub-Saharan Africa
5 ZW ZWE Zimbabwe 2020 1373. 61.1 15669666 Sub-Saharan Africa
6 ZW ZWE Zimbabwe 2021 1774. 59.3 15993524 Sub-Saharan Africa
Code
# Librería dplyr (https://dplyr.tidyverse.org/)
# y 'tuberías' (https://es.r4ds.hadley.nz/18-pipes.html)
#
# "Verbos" de dplyr
# select
<- select(gapminder, year, country, pop, gdpPercap)
gapminder_selected # filter
<- filter(gapminder_selected, year >= 1980)
gapminder_filtered # mutate
<- mutate(gapminder_filtered, GDP = gdpPercap*pop)
gapminder_mutated # group_by
<- group_by(gapminder_mutated, country)
gapminder_grouped # summarise
<- summarise(gapminder_grouped,
gapminder_summarised GDP_avg = mean(GDP, na.rm = TRUE))
# arrange
<- arrange(gapminder_summarised, GDP_avg)
gapminder_arranged_ascending gapminder_arranged_ascending
# A tibble: 217 × 2
country GDP_avg
<chr> <dbl>
1 Tuvalu 26885179.
2 Kiribati 99642313.
3 Nauru 102113631.
4 Marshall Islands 128774246.
5 Palau 214044364.
6 Tonga 247764480.
7 Micronesia, Fed. Sts. 251170068.
8 Sao Tome and Principe 256379527.
9 Dominica 335912659.
10 Vanuatu 416627524.
# ℹ 207 more rows
Code
<- arrange(gapminder_summarised, -GDP_avg)
gapminder_arranged_descending gapminder_arranged_descending
# A tibble: 217 × 2
country GDP_avg
<chr> <dbl>
1 United States 1.11e13
2 China 4.12e12
3 Japan 4.10e12
4 Germany 2.48e12
5 United Kingdom 1.82e12
6 France 1.79e12
7 Italy 1.44e12
8 Brazil 1.03e12
9 India 9.96e11
10 Canada 9.96e11
# ℹ 207 more rows
Code
# Operador tubería del tidyverse: encadenamientos con '%>%'
# (tubería nativa de R: |> )
<-
AVG_GDP %>%
gapminder select(year, country, pop, gdpPercap) %>%
filter(year>=1980) %>%
mutate(GDP=gdpPercap*pop) %>%
group_by(country) %>%
summarise(GDP_avg=mean(GDP, na.rm = TRUE)) %>%
arrange(-GDP_avg) %>%
na.omit()
head(AVG_GDP,10)
# A tibble: 10 × 2
country GDP_avg
<chr> <dbl>
1 United States 1.11e13
2 China 4.12e12
3 Japan 4.10e12
4 Germany 2.48e12
5 United Kingdom 1.82e12
6 France 1.79e12
7 Italy 1.44e12
8 Brazil 1.03e12
9 India 9.96e11
10 Canada 9.96e11
Code
tail(AVG_GDP,10)
# A tibble: 10 × 2
country GDP_avg
<chr> <dbl>
1 Vanuatu 416627524.
2 Dominica 335912659.
3 Sao Tome and Principe 256379527.
4 Micronesia, Fed. Sts. 251170068.
5 Tonga 247764480.
6 Palau 214044364.
7 Marshall Islands 128774246.
8 Nauru 102113631.
9 Kiribati 99642313.
10 Tuvalu 26885179.
Code
<-
COUNT_cntr %>%
gapminder select(year, region, country) %>%
filter(year>=1980) %>%
group_by(region) %>%
summarise(cntr_distinct = n_distinct(country))
COUNT_cntr
# A tibble: 7 × 2
region cntr_distinct
<chr> <int>
1 East Asia & Pacific 37
2 Europe & Central Asia 58
3 Latin America & Caribbean 42
4 Middle East & North Africa 21
5 North America 3
6 South Asia 8
7 Sub-Saharan Africa 48
Code
<-
AVG_lifeExp %>%
gapminder select(year, lifeExp) %>%
filter(year>=1980) %>%
group_by(year) %>%
summarise(lifeExp_avg=mean(lifeExp, , na.rm = TRUE))
AVG_lifeExp
# A tibble: 42 × 2
year lifeExp_avg
<dbl> <dbl>
1 1980 62.2
2 1981 62.6
3 1982 62.8
4 1983 63.1
5 1984 63.4
6 1985 63.7
7 1986 64.2
8 1987 64.4
9 1988 64.5
10 1989 65.0
# ℹ 32 more rows
Code
<-
AVG_lifeExp_gdpPercap %>%
gapminder select(year, region, lifeExp, gdpPercap) %>%
filter(year>=1980) %>%
group_by(year,region) %>%
summarise(lifeExp_avg=mean(lifeExp, , na.rm = TRUE),
gdpPercap_avg=mean(gdpPercap, , na.rm = TRUE))
AVG_lifeExp_gdpPercap
# A tibble: 294 × 4
# Groups: year [42]
year region lifeExp_avg gdpPercap_avg
<dbl> <chr> <dbl> <dbl>
1 1980 East Asia & Pacific 63.1 4381.
2 1980 Europe & Central Asia 70.0 12552.
3 1980 Latin America & Caribbean 65.9 2036.
4 1980 Middle East & North Africa 62.9 9305.
5 1980 North America 74.1 11667.
6 1980 South Asia 52.9 254.
7 1980 Sub-Saharan Africa 50.3 816.
8 1981 East Asia & Pacific 63.5 4014.
9 1981 Europe & Central Asia 70.2 11283.
10 1981 Latin America & Caribbean 66.3 2206.
# ℹ 284 more rows
Code
# Unir, combinar y remodelar ficheros de datos
# (<https://es.r4ds.hadley.nz/13-relational-data.html>)
# Familia de operaciones join:
# inner_join(df1, df2), left_join(df1, df2), right_join(df1, df2)
# full_join(df1, df2), semi_join(df1, df2), anti_join(df1, df2)
Code
# Lectura de datos
library(nycflights13)
flights
# A tibble: 336,776 × 19
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
<int> <int> <int> <int> <int> <dbl> <int> <int>
1 2013 1 1 517 515 2 830 819
2 2013 1 1 533 529 4 850 830
3 2013 1 1 542 540 2 923 850
4 2013 1 1 544 545 -1 1004 1022
5 2013 1 1 554 600 -6 812 837
6 2013 1 1 554 558 -4 740 728
7 2013 1 1 555 600 -5 913 854
8 2013 1 1 557 600 -3 709 723
9 2013 1 1 557 600 -3 838 846
10 2013 1 1 558 600 -2 753 745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
Code
planes
# A tibble: 3,322 × 9
tailnum year type manufacturer model engines seats speed engine
<chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
# ℹ 3,312 more rows
Code
# Ejemplo de unión (por la izquierda): left_join
# Debe usarse el argumento 'by =' para evitar errores o malas
# asignaciones automáticas.
# Ejemplos de otras operaciones en pueden encontrarse en:
# https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html
<- left_join(flights, planes, by = "tailnum") %>%
flights_planes select(month, day, dep_time, arr_time,
carrier, flight, tailnum, model) flights_planes
# A tibble: 336,776 × 8
month day dep_time arr_time carrier flight tailnum model
<int> <int> <int> <int> <chr> <int> <chr> <chr>
1 1 1 517 830 UA 1545 N14228 737-824
2 1 1 533 850 UA 1714 N24211 737-824
3 1 1 542 923 AA 1141 N619AA 757-223
4 1 1 544 1004 B6 725 N804JB A320-232
5 1 1 554 812 DL 461 N668DN 757-232
6 1 1 554 740 UA 1696 N39463 737-924ER
7 1 1 555 913 B6 507 N516JB A320-232
8 1 1 557 709 EV 5708 N829AS CL-600-2B19
9 1 1 557 838 B6 79 N593JB A320-232
10 1 1 558 753 AA 301 N3ALAA <NA>
# ℹ 336,766 more rows
Code
# Gramática de gráficas (ggplot2) [https://ggplot2.tidyverse.org/]
# (https://es.r4ds.hadley.nz/03-visualize.html)
<- gapminder %>%
gapminder filter(year>=1980) %>%
select(year, region, country, pop, lifeExp, gdpPercap)
<- gapminder %>% na.omit()
gapminder <- arrange(gapminder, year)
gapminder # Gráfica 1: diagrama de puntos básico
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.7)
Code
# Gráfica 2: con ajuste no paramétrico
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(alpha = 0.2) + geom_smooth(method = "loess")
Code
# Gráfica 3: colores por regiones
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(col = region), alpha = 0.3)
Code
# Gráfica 4: colores por regiones y tamaño por población
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, col = region), alpha = 0.3)
Code
# Gráfica 5: colores por regiones y tamaño por población (escala: log)
<- gapminder %>% mutate(l_gdpPercap=log(gdpPercap))
gapminder ggplot(data = gapminder, aes(x = l_gdpPercap, y = lifeExp)) +
geom_point(aes(size = pop, col = region), alpha = 0.3) +
labs(x = "PIB per capita (log)", y = "Experanza de vida al nacer") +
theme_minimal() # Tema b&w
Code
# Gráfica 6 (interactiva): burbujas de Hans-Rosling
# (https://www.gapminder.org/fw/world-health-chart/)
# Escalas de color: https://cran.r-project.org/web/packages/viridis/index.html
# Librería gganimate: https://gganimate.com/
# Librería plotly en R: https://plotly.com/r/
library(viridis)
library(gganimate)
library(plotly)
plot_ly(gapminder,
y = ~lifeExp,
x = ~l_gdpPercap,
frame = ~year,
type = 'scatter',
mode = 'markers',
size = ~pop,
color = ~region,
colors = 'Set1') %>%
layout(xaxis = list(title = "PIB per capita (log)"),
yaxis = list(title = "Experanza de vida al nacer"))
En las siguientes páginas web se pueden encontrar los detalles sobre la gestión ’a la tidyverse’ de datos y la realización de gráficos en Python:
librería
pandas
:https://pandas.pydata.org/docs/user_guide/index.html
https://wesmckinney.com/book/accessing-data
librería
plotnine
: https://plotnine.readthedocs.io/en/stable/
En términos generales, para aprender cómo hacer “ciencia de datos” con Python se puede consultar el libro de McKinney: https://wesmckinney.com/book/.
Code
# Lectura de librerías
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
# Lectura de datos
= pd.read_csv('data/GAPMINDER.csv')
gapminder gapminder.shape
(13454, 8)
Code
gapminder.columns
Index(['iso2c', 'iso3c', 'country', 'date', 'NY.GDP.PCAP.CD', 'SP.DYN.LE00.IN',
'SP.POP.TOTL', 'region'],
dtype='object')
Code
gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13454 entries, 0 to 13453
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 iso2c 13392 non-null object
1 iso3c 13454 non-null object
2 country 13454 non-null object
3 date 13454 non-null int64
4 NY.GDP.PCAP.CD 10346 non-null float64
5 SP.DYN.LE00.IN 12890 non-null float64
6 SP.POP.TOTL 13424 non-null float64
7 region 13454 non-null object
dtypes: float64(3), int64(1), object(4)
memory usage: 841.0+ KB
Code
# Cambio de nombres de variables
={'date': 'year'}, inplace=True)
gapminder.rename(columns={'NY.GDP.PCAP.CD': 'gdpPercap'}, inplace=True)
gapminder.rename(columns={'SP.DYN.LE00.IN': 'lifeExp'}, inplace=True)
gapminder.rename(columns={'SP.POP.TOTL': 'pop'}, inplace=True)
gapminder.rename(columns# Datos iniciales y finales
gapminder.head()
iso2c iso3c country ... lifeExp pop region
0 AW ABW Aruba ... 64.152 54608.0 Latin America & Caribbean
1 AW ABW Aruba ... 64.537 55811.0 Latin America & Caribbean
2 AW ABW Aruba ... 64.752 56682.0 Latin America & Caribbean
3 AW ABW Aruba ... 65.132 57475.0 Latin America & Caribbean
4 AW ABW Aruba ... 65.294 58178.0 Latin America & Caribbean
[5 rows x 8 columns]
Code
gapminder.tail()
iso2c iso3c country ... lifeExp pop region
13449 ZW ZWE Zimbabwe ... 60.709 14751101.0 Sub-Saharan Africa
13450 ZW ZWE Zimbabwe ... 61.414 15052184.0 Sub-Saharan Africa
13451 ZW ZWE Zimbabwe ... 61.292 15354608.0 Sub-Saharan Africa
13452 ZW ZWE Zimbabwe ... 61.124 15669666.0 Sub-Saharan Africa
13453 ZW ZWE Zimbabwe ... 59.253 15993524.0 Sub-Saharan Africa
[5 rows x 8 columns]
Code
# Operaciones del tidyverse con Pandas
# select
= gapminder[['year','country', 'pop', 'gdpPercap']]
gapminder_selected # filter
= gapminder_selected[(gapminder_selected["year"] >= 1980)]
gapminder_filtered # mutate
'GDP'] = gapminder_filtered['gdpPercap'] * gapminder_filtered['pop']
gapminder_filtered[# groupby
= gapminder_filtered.groupby('country')
gapminder_grouped # summarise
= gapminder_grouped['GDP'].mean()
gapminder_summarised = gapminder_summarised.dropna()
gapminder_summarised # arrange (sort)
=False).head(10) gapminder_summarised.sort_values(ascending
country
United States 1.113049e+13
China 4.117048e+12
Japan 4.101114e+12
Germany 2.481413e+12
United Kingdom 1.819280e+12
France 1.788376e+12
Italy 1.436080e+12
Brazil 1.031871e+12
India 9.960817e+11
Canada 9.955903e+11
Name: GDP, dtype: float64
Code
=True).head(10) gapminder_summarised.sort_values(ascending
country
Tuvalu 2.688518e+07
Kiribati 9.964231e+07
Nauru 1.021136e+08
Marshall Islands 1.287742e+08
Palau 2.140444e+08
Tonga 2.477645e+08
Micronesia, Fed. Sts. 2.511701e+08
Sao Tome and Principe 2.563795e+08
Dominica 3.359127e+08
Vanuatu 4.166275e+08
Name: GDP, dtype: float64
Code
# Se obtiene el mismo resultado con nlargest y nsmallest
10) gapminder_summarised.nlargest(
country
United States 1.113049e+13
China 4.117048e+12
Japan 4.101114e+12
Germany 2.481413e+12
United Kingdom 1.819280e+12
France 1.788376e+12
Italy 1.436080e+12
Brazil 1.031871e+12
India 9.960817e+11
Canada 9.955903e+11
Name: GDP, dtype: float64
Code
10) gapminder_summarised.nsmallest(
country
Tuvalu 2.688518e+07
Kiribati 9.964231e+07
Nauru 1.021136e+08
Marshall Islands 1.287742e+08
Palau 2.140444e+08
Tonga 2.477645e+08
Micronesia, Fed. Sts. 2.511701e+08
Sao Tome and Principe 2.563795e+08
Dominica 3.359127e+08
Vanuatu 4.166275e+08
Name: GDP, dtype: float64
Code
# Tuberías en pandas: operaciones encadenadas con '.'
"year"] >= 1980)].groupby('region')['country'].nunique() gapminder[(gapminder[
region
East Asia & Pacific 37
Europe & Central Asia 58
Latin America & Caribbean 42
Middle East & North Africa 21
North America 3
South Asia 8
Sub-Saharan Africa 48
Name: country, dtype: int64
Code
"year"] >= 1980)].groupby('year')['lifeExp'].mean() gapminder[(gapminder[
year
1980 62.219302
1981 62.562886
1982 62.788927
1983 63.105528
1984 63.428057
1985 63.710139
1986 64.154868
1987 64.389125
1988 64.506298
1989 64.971556
1990 65.176808
1991 65.301840
1992 65.289795
1993 65.498318
1994 65.806758
1995 66.005868
1996 66.227895
1997 66.427099
1998 66.494850
1999 66.871951
2000 67.358588
2001 67.663336
2002 67.928447
2003 68.227456
2004 68.591499
2005 68.897906
2006 69.249848
2007 69.554080
2008 69.885981
2009 70.273578
2010 70.580052
2011 70.972012
2012 71.277017
2013 71.533768
2014 71.803422
2015 72.002561
2016 72.301787
2017 72.522341
2018 72.720110
2019 72.930611
2020 72.309699
2021 71.725304
Name: lifeExp, dtype: float64
Code
"year"] >= 1980)].groupby(['year', 'region'])[['lifeExp', 'gdpPercap']].mean() gapminder[(gapminder[
lifeExp gdpPercap
year region
1980 East Asia & Pacific 63.130124 4380.888303
Europe & Central Asia 70.008017 12551.671629
Latin America & Caribbean 65.929825 2036.051429
Middle East & North Africa 62.922230 9305.049529
North America 74.055317 11666.984839
... ... ...
2021 Latin America & Caribbean 72.968782 14167.117807
Middle East & North Africa 74.367904 17286.260464
North America 79.401959 78117.587729
South Asia 70.535375 3177.624523
Sub-Saharan Africa 61.750413 2365.335575
[294 rows x 2 columns]
Code
# Unir, combinar y remodelar ficheros de datos
# (https://wesmckinney.com/book/data-wrangling)
= r.flights
flights = r.planes
planes = flights.merge(planes, on = 'tailnum', how = 'left')
flights_planes = flights_planes[['month', 'day', 'dep_time', 'arr_time', 'carrier', 'flight', 'tailnum', 'model']]
flights_planes_selected flights_planes_selected
month day dep_time arr_time carrier flight tailnum model
0 1 1 517 830 UA 1545 N14228 737-824
1 1 1 533 850 UA 1714 N24211 737-824
2 1 1 542 923 AA 1141 N619AA 757-223
3 1 1 544 1004 B6 725 N804JB A320-232
4 1 1 554 812 DL 461 N668DN 757-232
... ... ... ... ... ... ... ... ...
336771 9 30 -2147483648 -2147483648 9E 3393 None NaN
336772 9 30 -2147483648 -2147483648 9E 3525 None NaN
336773 9 30 -2147483648 -2147483648 MQ 3461 N535MQ NaN
336774 9 30 -2147483648 -2147483648 MQ 3572 N511MQ NaN
336775 9 30 -2147483648 -2147483648 MQ 3531 N839MQ NaN
[336776 rows x 8 columns]
Code
# Operaciones gráficas del tidyverse con `plotnine`
= gapminder[(gapminder["year"] >= 1980)][['year', 'region', 'country', 'pop', 'lifeExp', 'gdpPercap']]
gapminder = gapminder.dropna()
gapminder 'l_gdpPercap']=gapminder['gdpPercap'].map(lambda x:np.log(x))
gapminder[= gapminder = gapminder.sort_values(by=['year'])
gapminder # Gráfica 1: diagrama de puntos básico
(='gdpPercap', y='lifeExp'))
ggplot(gapminder, aes(x+ geom_point(alpha=0.7)
+ labs(x='gdpPercap', y='lifeExp')
)
<Figure Size: (640 x 480)>
Code
# Gráfica 2: con ajuste no paramétrico
(='gdpPercap', y='lifeExp'))
ggplot(gapminder, aes(x+ geom_point(alpha=0.2) + geom_smooth(method = "loess")
+ labs(x='gdpPercap', y='lifeExp')
)
<Figure Size: (640 x 480)>
Code
# Gráfica 3: colores por regiones
(='gdpPercap', y='lifeExp', color='factor(region)'))
ggplot(gapminder, aes(x+ geom_point(alpha=0.3)
+ labs(x='gdpPercap', y='lifeExp')
)
<Figure Size: (640 x 480)>
Code
# Gráfica 4: colores por regiones y tamaño por población
(='gdpPercap', y='lifeExp', color='region', size='pop'))
ggplot(gapminder, aes(x+ geom_point(alpha=0.3)
+ labs(x='gdpPercap', y='lifeExp')
)
<Figure Size: (640 x 480)>
Code
# Gráfica 5: colores por regiones y tamaño por población (esc. log.)
(='l_gdpPercap', y='lifeExp', color='region', size='pop'))
ggplot(gapminder, aes(x+ geom_point(alpha=0.3)
+ labs(x='PIB per capita (log)', y='Experanza de vida al nacer')
)
<Figure Size: (640 x 480)>
Code
# Gráfica 6 (interactiva): burbujas de Hans-Rosling
# Librería plotly en Python: https://plotly.com/python/
import plotly.express as px
px.scatter(gapminder,= "lifeExp",
y = "l_gdpPercap",
x = "country",
hover_name = ['country'],
hover_data= "region",
color = "pop", size_max = 45,
size = 'year'
animation_frame )