Changes in version 1.0.0 (2026-05-27) Breaking changes - get_ideb() has a new signature: get_ideb(level, stage, metric, year, quiet). The old positional usage get_ideb(year, level, stage) still works with a deprecation warning, but the year parameter now filters IDEB editions instead of selecting which file to download. - get_ideb() now returns data in tidy long format instead of wide format. Output columns depend on the metric parameter ("indicador", "aprovacao", "nota", "meta"). - get_ideb() now supports 5 geographic levels: "escola", "municipio", "estado", "regiao", and "brasil" (previously only escola and municipio). - get_ideb() always downloads the most recent IDEB file available, which contains the full historical series. The year parameter filters editions. - get_ideb_series() is deprecated. Use get_ideb(level, stage, metric) instead. - list_ideb_available() now returns level, stage, and metric columns (previously returned year, level, stage). - The uf parameter has been removed from get_ideb(). Filter the result with dplyr::filter() instead. New features - download_inep_file() timeout is now configurable via options(educabR.download_timeout = N) (seconds; default 600). Raise it when downloading large microdata (e.g. ENEM participantes at ~1.6 GB) over a slow link (issue #7). - read_inep_file() now warns before reading files larger than 500 MB entirely into memory, suggesting n_max or UF filters to reduce memory pressure. Suppressible with quiet = TRUE; all get_*() callers propagate their quiet argument (issue #5). Bug fixes - get_ideb() no longer consumes several GB of RAM for school-level reads (issue #1). The xlsx is now read with column projection: only the vl_* columns matching the requested metric (and year, when given) are parsed; the others are skipped at the readxl C++ layer. INEP's NA tokens ("", "-", "ND") are also passed to read_excel(na = ...) so the missing-value strings never get allocated as R character vectors. For level = "escola", stage = "anos_iniciais", metric = "indicador", this cuts the in-memory result from ~133 MB to ~37 MB (4 years) or ~19 MB (1 year), with proportional drops in peak memory during reshape. - Character columns from Excel readers (read_ideb_excel(), read_excel_safe()) and the FUNDEB enrollment OData fetcher (fetch_fundeb_enrollment()) are now normalized to UTF-8 NFC, matching the behavior already in read_inep_file(). Previously, equality comparisons against literals such as filter(rede == "Pública") could silently return zero rows on Windows because the source-file encoding produced non-canonical strings. The shared helper normalize_utf8_nfc() is now applied at every read entrypoint so all four code paths agree. Affects get_ideb(), get_cpc(), get_igc(), get_fundeb_enrollment(). - read_excel_safe() (used by get_cpc() and get_igc()) now passes INEP's missing-value tokens ("", "-", "ND", en/em dashes) to readxl::read_excel(na = ...) so those cells are loaded as NA instead of character strings cleaned up post-hoc (issue #4). Previously, a column whose first rows were all "-" could be inferred as logical and later numeric values silently dropped. clean_dash_values() remains as a safety net but is now largely redundant for CPC/IGC. Internal - download_inep_file() now verifies downloaded files before caching them (issue #3). Three checks run after the bytes hit disk: file size against the server's Content-Length (1% tolerance, catches truncated downloads), HTML-masquerade detection on the first 64 bytes (catches INEP maintenance pages served with HTTP 200), and ZIP magic-bytes (PK\x03\x04) for .zip destinations (catches proxy corruption). On any failure the corrupt file is deleted and the user gets a clear error telling them to retry, instead of a cryptic readxl / readr failure on the next call. - validate_year() now rejects vectors and non-numeric input with a clear error pointing at purrr::map_dfr() for multi-year composition (issue #2). Previously, passing c(2017, 2019) to any of the 13 affected getters (get_cpc, get_idd, get_igc, get_capes, get_saeb, get_enem, get_enem_itens, get_enade, get_encceja, get_fundeb_distribution, get_fundeb_enrollment, get_censo_escolar, get_censo_superior) hit either a cryptic length > 1 error (R ≥ 4.2) or silently used only the first element (R < 4.2). get_ideb() is unaffected — it intentionally accepts year vectors. - extract_zip() cleaned up: removed dead if (TRUE) branch and an unreachable cli_abort(); the muffle on extraction warnings was tightened from the broad erro|error pattern to the two specific messages that motivated it (issue #6). - RoxygenNote bumped to 8.0.0 and man/*.Rd regenerated; systemfonts and textshaping declared in Suggests: to silence the cosmetic R CMD check NOTE about packages pulled transitively by pkgdown. - R CMD check warnings cleared: em-dashes in cli_abort() message strings in R/utils-download.R are now written with Unicode escapes (R requires ASCII-only in code strings; comments are exempt). Changes in version 0.9.0 (2026-04-03) Breaking changes - available_years() now dynamically discovers available years by querying data sources (HEAD requests for INEP, OData queries for FNDE). Results are cached per session. Falls back to a hardcoded list when offline. - available_years() now accepts "fundeb_enrollment" as a separate dataset name. Previously, "fundeb" was shared between distribution and enrollment. - Code columns (CO_*, CD_*) are now read as character instead of numeric across all datasets. This prevents loss of leading zeros in codes like municipality codes, course codes, and institution codes. Bug fixes - Fixed get_enade() failing for 9 of 19 available years. INEP uses inconsistent URLs for ENADE: _LGPD suffix for 2012-2019, .rar format for 2022. Added hardcoded URL map (enade_urls) with all 19 correct URLs. - Fixed get_fundeb_enrollment() accepting years with no data in the FNDE API. The API currently only has data for 2017-2018. - Fixed encoding warning ("unable to translate MARCO") on Windows caused by Unicode cedilla in FUNDEB month name map. - Fixed clear_cache() failing to delete files on Windows when they were memory-mapped by readr. Now deletes entire directories and warns about locked files. - Fixed ZIP extraction fallback not triggering for ENADE 2017 ("Illegal byte sequence" error was not matched by the encoding detection pattern). New features - Added .rar archive extraction support via 7-Zip. find_7z() searches common Windows install paths when 7z is not in PATH. - Added strip_diacriticals() internal helper for encoding-safe text matching. - read_inep_file() now auto-detects code columns (CO_*, CD_*) from the file header and reads them as character. No user action required. - read_ideb_excel() and read_excel_safe() (CPC/IGC) now convert code columns to character after reading. Changes in version 0.8.0 New features FUNDEB (Fundo de Manutencao e Desenvolvimento da Educacao Basica) - get_fundeb_distribution(): Download FUNDEB resource distribution data (years 2007-2026). Reads all sheets from STN Excel files and returns tidy long-format data with monthly transfer amounts by state, funding source, destination (states/municipalities), and table type (fundeb/adjustment). - get_fundeb_enrollment(): Download FUNDEB enrollment data. Fetches from FNDE OData API with automatic pagination. Results cached as CSV. - Filtering parameters for distribution: uf, source (FPE, FPM, ICMS, etc.), and destination ("uf" or "municipio"). - Data sources: Tesouro Transparente (https://www.tesourotransparente.gov.br) and FNDE (https://www.fnde.gov.br). Changes in version 0.7.0 New features CAPES (Dados Abertos da Pos-Graduacao) - get_capes(): Download CAPES graduate education data (years 2013-2024). - Supports 5 data types: programs ("programas"), students ("discentes"), faculty ("docentes"), courses ("cursos"), and theses/dissertations catalog ("catalogo"). - Uses CKAN API to dynamically discover download URLs (CAPES URLs contain UUIDs). - Data source: CAPES Open Data Portal (https://dadosabertos.capes.gov.br). Changes in version 0.6.0 New features CPC (Conceito Preliminar de Curso) - get_cpc(): Download CPC data (years 2007-2019, 2021-2023; no 2020 edition). - Quality indicator for undergraduate courses, part of SINAES. - Files are in Excel format (xls/xlsx) — requires the readxl package. - Hardcoded URL map due to completely inconsistent INEP naming patterns. IGC (Indice Geral de Cursos) - get_igc(): Download IGC data (years 2007-2019, 2021-2023; no 2020 edition). - Institutional quality indicator based on weighted CPC averages and CAPES scores. - Files are in Excel format (xls/xlsx), except 2007 which is a 7z archive. - Hardcoded URL map due to completely inconsistent INEP naming patterns. Shared utilities - read_excel_safe(): Internal helper to read Excel files with error handling. Changes in version 0.5.0 New features ENEM por Escola (ENEM by School) - get_enem_escola(): Download ENEM results aggregated by school (2005-2015). - Single bundled file covering all years. Discontinued after 2015. IDD (Indicador de Diferença entre os Desempenhos Observado e Esperado) - get_idd(): Download IDD microdata (years 2014-2019, 2021-2023; no 2020 edition). - Measures the value added by undergraduate courses to student performance. - Handles both ZIP (2021+) and 7z (2014-2019) archive formats via new extract_archive() utility. - Automatic delimiter detection and dash-to-NA cleaning. Changes in version 0.4.0 New features ENCCEJA (Exame Nacional para Certificação de Competências de Jovens e Adultos) - get_encceja(): Download ENCCEJA microdata (years 2014-2024). - Automatic delimiter detection and dash-to-NA cleaning. Changes in version 0.3.0 New features ENADE (Exame Nacional de Desempenho dos Estudantes) - get_enade(): Download ENADE microdata. - Automatic delimiter detection and dash-to-NA cleaning. Censo da Educação Superior (Higher Education Census) - get_censo_superior(): Download Higher Education Census microdata (years 2009-2024). - Supports multiple data types: institutions ("ies"), courses ("cursos"), students ("alunos"), and faculty ("docentes"). - list_censo_superior_files(): List available files in a downloaded census. - UF filtering via the uf parameter. Changes in version 0.2.0 New features SAEB (Sistema de Avaliação da Educação Básica) - get_saeb(): Download SAEB microdata (years 2011, 2013, 2015, 2017, 2019, 2021, 2023). - Supports multiple data types: student results ("aluno"), school ("escola"), principal ("diretor"), and teacher ("professor") questionnaires. - Handles SAEB 2021 special case where INEP split downloads into elementary/high school and early childhood education files via the level parameter. Bug fixes - Fixed encoding detection on Windows using iconv() instead of validEnc(). - Fixed "Latin-1" encoding name to "latin1" for Windows codepage compatibility. - Fixed ENEM 2024+ support: new type parameter for split files ("participantes", "resultados"). - Added SAS datetime parsing for Censo Escolar date columns (dt_*). - Converted IDEB vl_* columns from character to numeric, handling "-", "ND", and comma decimals. Changes in version 0.1.2 (2026-02-19) New features - Added post-read data validation for all datasets. Errors on empty or corrupted files; warns when expected columns are missing (e.g., score columns for ENEM, UF columns for IDEB/Census) with actionable messages. - Downloads now show estimated file size before starting (e.g., "downloading 2.3 GB from INEP...") via HTTP HEAD request, with graceful fallback if size is unavailable. - get_ideb_series() now shows per-year progress indication (e.g., "processing IDEB 2017 (1/4)") and propagates the quiet parameter to inner get_ideb() calls. - get_enem_itens() now has keep_zip parameter for consistency with get_enem() and get_censo_escolar(). Documentation - Added English README (README.md) as default; Portuguese version renamed to README.pt-br.md with cross-links between both. - Fixed @param year ranges in documentation to match available_years(): - get_enem() / get_enem_itens(): 2009-2023 -> 1998-2024 - get_censo_escolar(): 2007-2024 -> 1995-2024 - Added @family tags to group related functions in help pages (ENEM, IDEB, School Census, cache). - Added English vignette (getting-started.Rmd). - Fixed Portuguese accents in README.pt-br.md. Tests - Added tests for enem_summary(): statistics calculation, NA handling, grouping by variable, and error on missing score columns. - Added tests for validate_data(): empty data, few columns, missing expected columns per dataset. CRAN - Replaced \donttest with \dontrun in all examples per CRAN request. Changes in version 0.1.1 Bug fixes - Fixed set_cache_dir() example that created a directory in the user's home (~/educabR_cache) during CRAN checks. Now uses tempdir() in examples. Changes in version 0.1.0 (2026-02-03) First public release. New features IDEB - get_ideb(): Download IDEB data (years 2017, 2019, 2021, 2023). - get_ideb_series(): Download IDEB historical series across multiple years. - list_ideb_available(): List available year/stage/level combinations. ENEM - get_enem(): Download ENEM microdata (years 1998-2024). - get_enem_itens(): Download ENEM item response data. - enem_summary(): Calculate summary statistics for ENEM scores. School Census - get_censo_escolar(): Download School Census microdata (years 1995-2024). - list_censo_files(): List available files in a downloaded census. Cache management - set_cache_dir(): Set custom cache directory. - get_cache_dir(): Get current cache directory. - clear_cache(): Clear cached files. - list_cache(): List cached files with metadata. Utilities - available_years(): Get available years for each dataset.