Aplicación 1.1: Gestión y visualización de datos

Gramática básica del tidyverse en R y Python

En esta aplicación se pondrán ejemplos básicos de gestión y visualización de datos en R y Python usando la “filosofía tidyverse”, una forma de trabajar dentro de cada lenguaje con el objetivo de estructurar los datos originales para un tratamiento estadístico posterior:

“A grandes rasgos, el tidyverse es un lenguaje para resolver los retos de la ciencia de datos […]. Su objetivo principal es facilitar una conversación entre un humano y un ordenador acerca de los datos. De forma menos abstracta, el tidyverse es una colección de […] librerías que comparten una filosofía de diseño de alto nivel y unas estructuras gramatical y de datos de bajo nivel, de modo que aprender una librería facilita el aprendizaje de la siguiente.” (Traducido al español de Wickham et al., 2019)

En las siguientes páginas web se pueden encontrar los detalles sobre la gestión (limpieza y preparación) ’a la tidyverse’ de bases de datos y realización de gráficos en R:

  1. Datos ordenados: https://tidyr.tidyverse.org/articles/tidy-data.html

  2. La macro-librería tidyverse en R:

    https://www.tidyverse.org/

    https://tidyverse.tidyverse.org/articles/paper.html

  3. Varias lecciones para entender cómo funciona el tidyverse en R:

    https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-1-getting-started/

    https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-2-data-visualisation/

    https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-3-data-wrangling-and-tidying/

    https://education.rstudio.com/blog/2020/07/teaching-the-tidyverse-in-2020-part-4-when-to-purrr/

En términos generales, para aprender cómo hacer “ciencia de datos” con R se puede consultar el libro de de Wickham, Çetinkaya-Rundel y Grolemund: https://r4ds.hadley.nz/ (la versión en español del libro está disponible en https://es.r4ds.hadley.nz/).

Sobre los datos utilizados en esta aplicación:

  1. Gapminder (https://www.gapminder.org/fw/world-health-chart/): los datos por países se han extraído de la base de datos del Banco Mundial (https://data.worldbank.org/), usando la librería WDI (https://vincentarelbundock.github.io/WDI/index.html).

  2. NYC_Flights_2013 (https://github.com/tidyverse/nycflights13): vuelos con salida en Nueva York y destino Estados Unidos, Puerto Rico y las Islas Vírgenes.

Code
# Lectura de librerías
library(tidyverse)
tidyverse_packages()
 [1] "broom"         "conflicted"    "cli"           "dbplyr"       
 [5] "dplyr"         "dtplyr"        "forcats"       "ggplot2"      
 [9] "googledrive"   "googlesheets4" "haven"         "hms"          
[13] "httr"          "jsonlite"      "lubridate"     "magrittr"     
[17] "modelr"        "pillar"        "purrr"         "ragg"         
[21] "readr"         "readxl"        "reprex"        "rlang"        
[25] "rstudioapi"    "rvest"         "stringr"       "tibble"       
[29] "tidyr"         "xml2"          "tidyverse"    

Code
library(reticulate)
Code
# Lectura de datos
# (https://es.r4ds.hadley.nz/10-tibble.html)
# (https://es.r4ds.hadley.nz/11-import.html)
gapminder <- read_csv("data/GAPMINDER.csv")
dim(gapminder)
[1] 13454     8
Code
class(gapminder)
[1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame" 
Code
# Manejo y transformación de datos
# (https://es.r4ds.hadley.nz/05-transform.html)
#
# Cambio de nombres de variables
gapminder <- gapminder %>% 
  rename(year = date, 
         gdpPercap = NY.GDP.PCAP.CD, 
         lifeExp = SP.DYN.LE00.IN, 
         pop = SP.POP.TOTL)
# Datos iniciales y finales
head(gapminder)
# A tibble: 6 × 8
  iso2c iso3c country  year gdpPercap lifeExp   pop region                   
  <chr> <chr> <chr>   <dbl>     <dbl>   <dbl> <dbl> <chr>                    
1 AW    ABW   Aruba    1960        NA    64.2 54608 Latin America & Caribbean
2 AW    ABW   Aruba    1961        NA    64.5 55811 Latin America & Caribbean
3 AW    ABW   Aruba    1962        NA    64.8 56682 Latin America & Caribbean
4 AW    ABW   Aruba    1963        NA    65.1 57475 Latin America & Caribbean
5 AW    ABW   Aruba    1964        NA    65.3 58178 Latin America & Caribbean
6 AW    ABW   Aruba    1965        NA    65.5 58782 Latin America & Caribbean
Code
tail(gapminder)
# A tibble: 6 × 8
  iso2c iso3c country   year gdpPercap lifeExp      pop region            
  <chr> <chr> <chr>    <dbl>     <dbl>   <dbl>    <dbl> <chr>             
1 ZW    ZWE   Zimbabwe  2016     1422.    60.3 14452704 Sub-Saharan Africa
2 ZW    ZWE   Zimbabwe  2017     1192.    60.7 14751101 Sub-Saharan Africa
3 ZW    ZWE   Zimbabwe  2018     2269.    61.4 15052184 Sub-Saharan Africa
4 ZW    ZWE   Zimbabwe  2019     1422.    61.3 15354608 Sub-Saharan Africa
5 ZW    ZWE   Zimbabwe  2020     1373.    61.1 15669666 Sub-Saharan Africa
6 ZW    ZWE   Zimbabwe  2021     1774.    59.3 15993524 Sub-Saharan Africa
Code
# Librería dplyr (https://dplyr.tidyverse.org/) 
# y 'tuberías' (https://es.r4ds.hadley.nz/18-pipes.html)
#
# "Verbos" de dplyr 
# select
gapminder_selected <- select(gapminder, year, country, pop, gdpPercap)
# filter
gapminder_filtered <- filter(gapminder_selected, year >= 1980)
# mutate
gapminder_mutated <- mutate(gapminder_filtered, GDP = gdpPercap*pop)
# group_by
gapminder_grouped <- group_by(gapminder_mutated, country)
# summarise
gapminder_summarised <- summarise(gapminder_grouped, 
                                  GDP_avg = mean(GDP, na.rm = TRUE))
# arrange
gapminder_arranged_ascending <- arrange(gapminder_summarised, GDP_avg)
gapminder_arranged_ascending
# A tibble: 217 × 2
   country                  GDP_avg
   <chr>                      <dbl>
 1 Tuvalu                 26885179.
 2 Kiribati               99642313.
 3 Nauru                 102113631.
 4 Marshall Islands      128774246.
 5 Palau                 214044364.
 6 Tonga                 247764480.
 7 Micronesia, Fed. Sts. 251170068.
 8 Sao Tome and Principe 256379527.
 9 Dominica              335912659.
10 Vanuatu               416627524.
# ℹ 207 more rows
Code
gapminder_arranged_descending <- arrange(gapminder_summarised, -GDP_avg)
gapminder_arranged_descending
# A tibble: 217 × 2
   country        GDP_avg
   <chr>            <dbl>
 1 United States  1.11e13
 2 China          4.12e12
 3 Japan          4.10e12
 4 Germany        2.48e12
 5 United Kingdom 1.82e12
 6 France         1.79e12
 7 Italy          1.44e12
 8 Brazil         1.03e12
 9 India          9.96e11
10 Canada         9.96e11
# ℹ 207 more rows
Code
# Operador tubería del tidyverse: encadenamientos con '%>%'
# (tubería nativa de R: |> )
AVG_GDP <- 
  gapminder %>% 
  select(year, country, pop, gdpPercap) %>% 
  filter(year>=1980) %>% 
  mutate(GDP=gdpPercap*pop) %>% 
  group_by(country) %>% 
  summarise(GDP_avg=mean(GDP, na.rm = TRUE)) %>% 
  arrange(-GDP_avg) %>% 
  na.omit()
head(AVG_GDP,10)
# A tibble: 10 × 2
   country        GDP_avg
   <chr>            <dbl>
 1 United States  1.11e13
 2 China          4.12e12
 3 Japan          4.10e12
 4 Germany        2.48e12
 5 United Kingdom 1.82e12
 6 France         1.79e12
 7 Italy          1.44e12
 8 Brazil         1.03e12
 9 India          9.96e11
10 Canada         9.96e11
Code
tail(AVG_GDP,10)
# A tibble: 10 × 2
   country                  GDP_avg
   <chr>                      <dbl>
 1 Vanuatu               416627524.
 2 Dominica              335912659.
 3 Sao Tome and Principe 256379527.
 4 Micronesia, Fed. Sts. 251170068.
 5 Tonga                 247764480.
 6 Palau                 214044364.
 7 Marshall Islands      128774246.
 8 Nauru                 102113631.
 9 Kiribati               99642313.
10 Tuvalu                 26885179.
Code
COUNT_cntr <- 
  gapminder %>%
  select(year, region, country) %>% 
  filter(year>=1980) %>% 
  group_by(region) %>%
  summarise(cntr_distinct = n_distinct(country))
COUNT_cntr
# A tibble: 7 × 2
  region                     cntr_distinct
  <chr>                              <int>
1 East Asia & Pacific                   37
2 Europe & Central Asia                 58
3 Latin America & Caribbean             42
4 Middle East & North Africa            21
5 North America                          3
6 South Asia                             8
7 Sub-Saharan Africa                    48
Code
AVG_lifeExp <- 
  gapminder %>%
  select(year, lifeExp) %>% 
  filter(year>=1980) %>% 
  group_by(year) %>%
  summarise(lifeExp_avg=mean(lifeExp, , na.rm = TRUE)) 
AVG_lifeExp
# A tibble: 42 × 2
    year lifeExp_avg
   <dbl>       <dbl>
 1  1980        62.2
 2  1981        62.6
 3  1982        62.8
 4  1983        63.1
 5  1984        63.4
 6  1985        63.7
 7  1986        64.2
 8  1987        64.4
 9  1988        64.5
10  1989        65.0
# ℹ 32 more rows
Code
AVG_lifeExp_gdpPercap <- 
  gapminder %>% 
   select(year, region, lifeExp, gdpPercap) %>% 
  filter(year>=1980) %>% 
  group_by(year,region) %>%
  summarise(lifeExp_avg=mean(lifeExp, , na.rm = TRUE),
            gdpPercap_avg=mean(gdpPercap, , na.rm = TRUE)) 
AVG_lifeExp_gdpPercap
# A tibble: 294 × 4
# Groups:   year [42]
    year region                     lifeExp_avg gdpPercap_avg
   <dbl> <chr>                            <dbl>         <dbl>
 1  1980 East Asia & Pacific               63.1         4381.
 2  1980 Europe & Central Asia             70.0        12552.
 3  1980 Latin America & Caribbean         65.9         2036.
 4  1980 Middle East & North Africa        62.9         9305.
 5  1980 North America                     74.1        11667.
 6  1980 South Asia                        52.9          254.
 7  1980 Sub-Saharan Africa                50.3          816.
 8  1981 East Asia & Pacific               63.5         4014.
 9  1981 Europe & Central Asia             70.2        11283.
10  1981 Latin America & Caribbean         66.3         2206.
# ℹ 284 more rows
Code
# Unir, combinar y remodelar ficheros de datos 
# (<https://es.r4ds.hadley.nz/13-relational-data.html>)
# Familia de operaciones join:
# inner_join(df1, df2), left_join(df1, df2), right_join(df1, df2)
# full_join(df1, df2), semi_join(df1, df2), anti_join(df1, df2)

Code
# Lectura de datos
library(nycflights13)
flights 
# A tibble: 336,776 × 19
    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
 1  2013     1     1      517            515         2      830            819
 2  2013     1     1      533            529         4      850            830
 3  2013     1     1      542            540         2      923            850
 4  2013     1     1      544            545        -1     1004           1022
 5  2013     1     1      554            600        -6      812            837
 6  2013     1     1      554            558        -4      740            728
 7  2013     1     1      555            600        -5      913            854
 8  2013     1     1      557            600        -3      709            723
 9  2013     1     1      557            600        -3      838            846
10  2013     1     1      558            600        -2      753            745
# ℹ 336,766 more rows
# ℹ 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>
Code
planes
# A tibble: 3,322 × 9
   tailnum  year type              manufacturer model engines seats speed engine
   <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
 1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
 6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
 9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
# ℹ 3,312 more rows
Code
# Ejemplo de unión (por la izquierda): left_join
# Debe usarse el argumento 'by =' para evitar errores o malas
# asignaciones automáticas. 
# Ejemplos de otras operaciones en pueden encontrarse en:
# https://cran.r-project.org/web/packages/dplyr/vignettes/two-table.html
flights_planes <- left_join(flights, planes, by = "tailnum") %>%
  select(month, day, dep_time, arr_time, 
         carrier, flight, tailnum, model)
flights_planes
# A tibble: 336,776 × 8
   month   day dep_time arr_time carrier flight tailnum model      
   <int> <int>    <int>    <int> <chr>    <int> <chr>   <chr>      
 1     1     1      517      830 UA        1545 N14228  737-824    
 2     1     1      533      850 UA        1714 N24211  737-824    
 3     1     1      542      923 AA        1141 N619AA  757-223    
 4     1     1      544     1004 B6         725 N804JB  A320-232   
 5     1     1      554      812 DL         461 N668DN  757-232    
 6     1     1      554      740 UA        1696 N39463  737-924ER  
 7     1     1      555      913 B6         507 N516JB  A320-232   
 8     1     1      557      709 EV        5708 N829AS  CL-600-2B19
 9     1     1      557      838 B6          79 N593JB  A320-232   
10     1     1      558      753 AA         301 N3ALAA  <NA>       
# ℹ 336,766 more rows
Code
# Gramática de gráficas (ggplot2) [https://ggplot2.tidyverse.org/]
# (https://es.r4ds.hadley.nz/03-visualize.html)
gapminder <- gapminder %>% 
   filter(year>=1980) %>% 
   select(year, region, country, pop, lifeExp, gdpPercap)
gapminder <- gapminder %>% na.omit()
gapminder <- arrange(gapminder, year)
# Gráfica 1: diagrama de puntos básico
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(alpha = 0.7)

Code
# Gráfica 2: con ajuste no paramétrico
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(alpha = 0.2) + geom_smooth(method = "loess")

Code
# Gráfica 3: colores por regiones
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(aes(col = region), alpha = 0.3)

Code
# Gráfica 4: colores por regiones y tamaño por población
ggplot(data = gapminder, aes(x = gdpPercap, y = lifeExp)) + 
  geom_point(aes(size = pop, col = region), alpha = 0.3)

Code
# Gráfica 5: colores por regiones y tamaño por población (escala: log)
gapminder <-  gapminder %>% mutate(l_gdpPercap=log(gdpPercap))
ggplot(data = gapminder, aes(x = l_gdpPercap, y = lifeExp)) +
    geom_point(aes(size = pop, col = region), alpha = 0.3) +
    labs(x = "PIB per capita (log)", y = "Experanza de vida al nacer") + 
    theme_minimal() # Tema b&w

Code
# Gráfica 6 (interactiva): burbujas de Hans-Rosling 
# (https://www.gapminder.org/fw/world-health-chart/)
# Escalas de color: https://cran.r-project.org/web/packages/viridis/index.html
# Librería gganimate: https://gganimate.com/
# Librería plotly en R: https://plotly.com/r/
library(viridis)
library(gganimate)
library(plotly)
plot_ly(gapminder, 
        y = ~lifeExp, 
        x = ~l_gdpPercap,
        frame = ~year,
        type = 'scatter', 
        mode = 'markers',
        size = ~pop,
        color = ~region, 
        colors = 'Set1') %>%
    layout(xaxis = list(title = "PIB per capita (log)"),
           yaxis = list(title = "Experanza de vida al nacer"))

En las siguientes páginas web se pueden encontrar los detalles sobre la gestión ’a la tidyverse’ de datos y la realización de gráficos en Python:

  1. librería pandas:

    https://pandas.pydata.org/docs/user_guide/index.html

    https://wesmckinney.com/book/accessing-data

    https://wesmckinney.com/book/data-cleaning

    https://wesmckinney.com/book/data-wrangling

  2. librería plotnine: https://plotnine.readthedocs.io/en/stable/

En términos generales, para aprender cómo hacer “ciencia de datos” con Python se puede consultar el libro de McKinney: https://wesmckinney.com/book/.

Code
# Lectura de librerías
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from plotnine import *
# Lectura de datos
gapminder = pd.read_csv('data/GAPMINDER.csv')
gapminder.shape
(13454, 8)
Code
gapminder.columns
Index(['iso2c', 'iso3c', 'country', 'date', 'NY.GDP.PCAP.CD', 'SP.DYN.LE00.IN',
       'SP.POP.TOTL', 'region'],
      dtype='object')
Code
gapminder.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13454 entries, 0 to 13453
Data columns (total 8 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   iso2c           13392 non-null  object 
 1   iso3c           13454 non-null  object 
 2   country         13454 non-null  object 
 3   date            13454 non-null  int64  
 4   NY.GDP.PCAP.CD  10346 non-null  float64
 5   SP.DYN.LE00.IN  12890 non-null  float64
 6   SP.POP.TOTL     13424 non-null  float64
 7   region          13454 non-null  object 
dtypes: float64(3), int64(1), object(4)
memory usage: 841.0+ KB
Code
# Cambio de nombres de variables
gapminder.rename(columns={'date': 'year'}, inplace=True)
gapminder.rename(columns={'NY.GDP.PCAP.CD': 'gdpPercap'}, inplace=True)
gapminder.rename(columns={'SP.DYN.LE00.IN': 'lifeExp'}, inplace=True)
gapminder.rename(columns={'SP.POP.TOTL': 'pop'}, inplace=True)
# Datos iniciales y finales
gapminder.head()
  iso2c iso3c country  ...  lifeExp      pop                     region
0    AW   ABW   Aruba  ...   64.152  54608.0  Latin America & Caribbean
1    AW   ABW   Aruba  ...   64.537  55811.0  Latin America & Caribbean
2    AW   ABW   Aruba  ...   64.752  56682.0  Latin America & Caribbean
3    AW   ABW   Aruba  ...   65.132  57475.0  Latin America & Caribbean
4    AW   ABW   Aruba  ...   65.294  58178.0  Latin America & Caribbean

[5 rows x 8 columns]
Code
gapminder.tail()
      iso2c iso3c   country  ...  lifeExp         pop              region
13449    ZW   ZWE  Zimbabwe  ...   60.709  14751101.0  Sub-Saharan Africa
13450    ZW   ZWE  Zimbabwe  ...   61.414  15052184.0  Sub-Saharan Africa
13451    ZW   ZWE  Zimbabwe  ...   61.292  15354608.0  Sub-Saharan Africa
13452    ZW   ZWE  Zimbabwe  ...   61.124  15669666.0  Sub-Saharan Africa
13453    ZW   ZWE  Zimbabwe  ...   59.253  15993524.0  Sub-Saharan Africa

[5 rows x 8 columns]
Code
# Operaciones del tidyverse con Pandas
# select
gapminder_selected = gapminder[['year','country', 'pop', 'gdpPercap']]
# filter
gapminder_filtered = gapminder_selected[(gapminder_selected["year"] >= 1980)]
# mutate
gapminder_filtered['GDP'] = gapminder_filtered['gdpPercap'] * gapminder_filtered['pop']
# groupby
gapminder_grouped = gapminder_filtered.groupby('country')
# summarise
gapminder_summarised = gapminder_grouped['GDP'].mean()
gapminder_summarised = gapminder_summarised.dropna()
# arrange (sort)
gapminder_summarised.sort_values(ascending=False).head(10)
country
United States     1.113049e+13
China             4.117048e+12
Japan             4.101114e+12
Germany           2.481413e+12
United Kingdom    1.819280e+12
France            1.788376e+12
Italy             1.436080e+12
Brazil            1.031871e+12
India             9.960817e+11
Canada            9.955903e+11
Name: GDP, dtype: float64
Code
gapminder_summarised.sort_values(ascending=True).head(10)
country
Tuvalu                   2.688518e+07
Kiribati                 9.964231e+07
Nauru                    1.021136e+08
Marshall Islands         1.287742e+08
Palau                    2.140444e+08
Tonga                    2.477645e+08
Micronesia, Fed. Sts.    2.511701e+08
Sao Tome and Principe    2.563795e+08
Dominica                 3.359127e+08
Vanuatu                  4.166275e+08
Name: GDP, dtype: float64
Code
# Se obtiene el mismo resultado con nlargest y nsmallest
gapminder_summarised.nlargest(10)
country
United States     1.113049e+13
China             4.117048e+12
Japan             4.101114e+12
Germany           2.481413e+12
United Kingdom    1.819280e+12
France            1.788376e+12
Italy             1.436080e+12
Brazil            1.031871e+12
India             9.960817e+11
Canada            9.955903e+11
Name: GDP, dtype: float64
Code
gapminder_summarised.nsmallest(10)
country
Tuvalu                   2.688518e+07
Kiribati                 9.964231e+07
Nauru                    1.021136e+08
Marshall Islands         1.287742e+08
Palau                    2.140444e+08
Tonga                    2.477645e+08
Micronesia, Fed. Sts.    2.511701e+08
Sao Tome and Principe    2.563795e+08
Dominica                 3.359127e+08
Vanuatu                  4.166275e+08
Name: GDP, dtype: float64
Code
# Tuberías en pandas: operaciones encadenadas con '.'
gapminder[(gapminder["year"] >= 1980)].groupby('region')['country'].nunique()
region
East Asia & Pacific           37
Europe & Central Asia         58
Latin America & Caribbean     42
Middle East & North Africa    21
North America                  3
South Asia                     8
Sub-Saharan Africa            48
Name: country, dtype: int64
Code
gapminder[(gapminder["year"] >= 1980)].groupby('year')['lifeExp'].mean()
year
1980    62.219302
1981    62.562886
1982    62.788927
1983    63.105528
1984    63.428057
1985    63.710139
1986    64.154868
1987    64.389125
1988    64.506298
1989    64.971556
1990    65.176808
1991    65.301840
1992    65.289795
1993    65.498318
1994    65.806758
1995    66.005868
1996    66.227895
1997    66.427099
1998    66.494850
1999    66.871951
2000    67.358588
2001    67.663336
2002    67.928447
2003    68.227456
2004    68.591499
2005    68.897906
2006    69.249848
2007    69.554080
2008    69.885981
2009    70.273578
2010    70.580052
2011    70.972012
2012    71.277017
2013    71.533768
2014    71.803422
2015    72.002561
2016    72.301787
2017    72.522341
2018    72.720110
2019    72.930611
2020    72.309699
2021    71.725304
Name: lifeExp, dtype: float64
Code
gapminder[(gapminder["year"] >= 1980)].groupby(['year', 'region'])[['lifeExp', 'gdpPercap']].mean()
                                   lifeExp     gdpPercap
year region                                             
1980 East Asia & Pacific         63.130124   4380.888303
     Europe & Central Asia       70.008017  12551.671629
     Latin America & Caribbean   65.929825   2036.051429
     Middle East & North Africa  62.922230   9305.049529
     North America               74.055317  11666.984839
...                                    ...           ...
2021 Latin America & Caribbean   72.968782  14167.117807
     Middle East & North Africa  74.367904  17286.260464
     North America               79.401959  78117.587729
     South Asia                  70.535375   3177.624523
     Sub-Saharan Africa          61.750413   2365.335575

[294 rows x 2 columns]
Code
# Unir, combinar y remodelar ficheros de datos
# (https://wesmckinney.com/book/data-wrangling)
flights = r.flights 
planes = r.planes
flights_planes = flights.merge(planes, on = 'tailnum', how = 'left')
flights_planes_selected = flights_planes[['month', 'day', 'dep_time', 'arr_time', 'carrier', 'flight', 'tailnum', 'model']]
flights_planes_selected
        month  day    dep_time    arr_time carrier  flight tailnum     model
0           1    1         517         830      UA    1545  N14228   737-824
1           1    1         533         850      UA    1714  N24211   737-824
2           1    1         542         923      AA    1141  N619AA   757-223
3           1    1         544        1004      B6     725  N804JB  A320-232
4           1    1         554         812      DL     461  N668DN   757-232
...       ...  ...         ...         ...     ...     ...     ...       ...
336771      9   30 -2147483648 -2147483648      9E    3393    None       NaN
336772      9   30 -2147483648 -2147483648      9E    3525    None       NaN
336773      9   30 -2147483648 -2147483648      MQ    3461  N535MQ       NaN
336774      9   30 -2147483648 -2147483648      MQ    3572  N511MQ       NaN
336775      9   30 -2147483648 -2147483648      MQ    3531  N839MQ       NaN

[336776 rows x 8 columns]
Code
# Operaciones gráficas del tidyverse con `plotnine`
gapminder = gapminder[(gapminder["year"] >= 1980)][['year', 'region', 'country', 'pop', 'lifeExp', 'gdpPercap']]
gapminder = gapminder.dropna()
gapminder['l_gdpPercap']=gapminder['gdpPercap'].map(lambda x:np.log(x))
gapminder = gapminder = gapminder.sort_values(by=['year'])
# Gráfica 1: diagrama de puntos básico
(
    ggplot(gapminder, aes(x='gdpPercap', y='lifeExp'))
    + geom_point(alpha=0.7)
    + labs(x='gdpPercap', y='lifeExp')
)
<Figure Size: (640 x 480)>

Code
# Gráfica 2: con ajuste no paramétrico
(
    ggplot(gapminder, aes(x='gdpPercap', y='lifeExp'))
    + geom_point(alpha=0.2) + geom_smooth(method = "loess")
    + labs(x='gdpPercap', y='lifeExp')
)
<Figure Size: (640 x 480)>

Code
# Gráfica 3: colores por regiones
(
    ggplot(gapminder, aes(x='gdpPercap', y='lifeExp', color='factor(region)'))
    + geom_point(alpha=0.3)
    +  labs(x='gdpPercap', y='lifeExp')
)
<Figure Size: (640 x 480)>

Code
# Gráfica 4: colores por regiones y tamaño por población
(
    ggplot(gapminder, aes(x='gdpPercap', y='lifeExp', color='region', size='pop'))
    + geom_point(alpha=0.3)
    +  labs(x='gdpPercap', y='lifeExp')
)
<Figure Size: (640 x 480)>

Code
# Gráfica 5: colores por regiones y tamaño por población (esc. log.)
(
    ggplot(gapminder, aes(x='l_gdpPercap', y='lifeExp', color='region', size='pop'))
    + geom_point(alpha=0.3)
    +  labs(x='PIB per capita (log)', y='Experanza de vida al nacer')
)
<Figure Size: (640 x 480)>

Code
# Gráfica 6 (interactiva): burbujas de Hans-Rosling
# Librería plotly en Python: https://plotly.com/python/
import plotly.express as px
px.scatter(gapminder,
            y = "lifeExp", 
            x = "l_gdpPercap",
            hover_name = "country",
            hover_data= ['country'], 
            color = "region", 
            size = "pop", size_max = 45,
            animation_frame= 'year'
)