Q&A 1 How do you obtain real-world economics datasets?

1.1 Explanation

The most reliable open sources for macroeconomic data (GDP, inflation, unemployment, trade, etc.) are:

  • World Bank – World Development Indicators (WDI): global coverage since ~1960, CSV/API.
  • IMF – World Economic Outlook (WEO): historical + forecasts, Excel/CSV.
  • UN Data: national accounts, trade, population; CSV/Excel.
  • OECD Data: rich indicators for OECD members; CSV/API.
  • FRED via Nasdaq Data Link: US & global macro series through an API.

In CDI we’ll start with the World Bank API (via wbdata) because it’s free, broad, and simple. You can later combine with Nasdaq Data Link (FRED) for complementary series.


1.2 World Bank wbdata example

Install packages (note: datetime is built-in; no need to install it)

pip install wbdata pandas matplotlib requests

# Optional (for later FRED use):
pip install nasdaq-data-link python-dotenv

1.3 Python Code

# Fast EAC GDP fetch: World Bank API + parallel requests
# Requires: pip install pandas requests

import os, time, requests, pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed

EAC = ["BDI", "COD", "KEN", "RWA", "SSD", "TZA", "UGA", "SOM", "USA"]  # EAC members
INDICATOR = "NY.GDP.MKTP.CD"  # GDP (current US$)
START, END = 2000, 2024

os.makedirs("data", exist_ok=True)

def fetch_country(country, indicator=INDICATOR, start=START, end=END,
                  max_retries=5, backoff=1.6, timeout=12):
    """Fetch one country's indicator with retry/backoff. Returns tidy DataFrame."""
    base = f"https://api.worldbank.org/v2/country/{country}/indicator/{indicator}"
    params = {"date": f"{start}:{end}", "format": "json", "per_page": 20000}
    for attempt in range(1, max_retries + 1):
        try:
            r = requests.get(base, params=params, timeout=timeout)
            r.raise_for_status()
            data = r.json()
            rows = data[1] if isinstance(data, list) and len(data) > 1 else []
            df = pd.DataFrame([
                {"country": row["country"]["id"], "year": int(row["date"]), "value": row["value"]}
                for row in rows if row["value"] is not None
            ])
            if not df.empty:
                df = df.sort_values("year").reset_index(drop=True)
            return df
        except Exception as e:
            if attempt == max_retries:
                print(f"[{country}] failed after {attempt} attempts: {e}")
                return pd.DataFrame(columns=["country", "year", "value"])
            sleep_s = backoff ** attempt
            time.sleep(sleep_s)

# Fetch in parallel (adjust workers if your connection is slower/faster)
dfs = []
with ThreadPoolExecutor(max_workers=6) as ex:
    futures = {ex.submit(fetch_country, c): c for c in EAC}
    for fut in as_completed(futures):
        c = futures[fut]
        dfc = fut.result()
        if dfc is not None and not dfc.empty:
            dfs.append(dfc)
        else:
            print(f"[{c}] no data returned.")

# Combine, clean, save
if dfs:
    df = pd.concat(dfs, ignore_index=True)
    df = df.rename(columns={"value": "gdp_usd_current"})
    out = "data/gdp_wdi_EAC_USA_2000_2024.csv"
    df.to_csv(out, index=False)
    print(f"Saved -> {out}")
    print(df.head(12))
else:
    print("No data fetched. Try reducing the date range or check your network.")