get_ideb() has a new signature: get_ideb(level, stage, metric, year, quiet).
The old positional usage get_ideb(year, level, stage) still works with a
deprecation warning, but the year parameter now filters IDEB editions
instead of selecting which file to download.get_ideb() now returns data in tidy long format instead of wide format.
Output columns depend on the metric parameter ("indicador", "aprovacao",
"nota", "meta").get_ideb() now supports 5 geographic levels: "escola", "municipio",
"estado", "regiao", and "brasil" (previously only escola and municipio).get_ideb() always downloads the most recent IDEB file available, which
contains the full historical series. The year parameter filters editions.get_ideb_series() is deprecated. Use get_ideb(level, stage, metric) instead.list_ideb_available() now returns level, stage, and metric columns
(previously returned year, level, stage).uf parameter has been removed from get_ideb(). Filter the result
with dplyr::filter() instead.download_inep_file() timeout is now configurable via
options(educabR.download_timeout = N) (seconds; default 600). Raise it
when downloading large microdata (e.g. ENEM participantes at ~1.6 GB)
over a slow link (issue #7).read_inep_file() now warns before reading files larger than 500 MB
entirely into memory, suggesting n_max or UF filters to reduce memory
pressure. Suppressible with quiet = TRUE; all get_*() callers
propagate their quiet argument (issue #5).get_ideb() no longer consumes several GB of RAM for school-level reads
(issue #1). The xlsx is now read with column projection: only the vl_*
columns matching the requested metric (and year, when given) are
parsed; the others are skipped at the readxl C++ layer. INEP's NA tokens
("", "-", "ND") are also passed to read_excel(na = ...) so the
missing-value strings never get allocated as R character vectors.
For level = "escola", stage = "anos_iniciais", metric = "indicador",
this cuts the in-memory result from ~133 MB to ~37 MB (4 years) or
~19 MB (1 year), with proportional drops in peak memory during reshape.read_ideb_excel(),
read_excel_safe()) and the FUNDEB enrollment OData fetcher
(fetch_fundeb_enrollment()) are now normalized to UTF-8 NFC, matching
the behavior already in read_inep_file(). Previously, equality
comparisons against literals such as filter(rede == "Pública") could
silently return zero rows on Windows because the source-file encoding
produced non-canonical strings. The shared helper normalize_utf8_nfc()
is now applied at every read entrypoint so all four code paths agree.
Affects get_ideb(), get_cpc(), get_igc(), get_fundeb_enrollment().read_excel_safe() (used by get_cpc() and get_igc()) now passes INEP's
missing-value tokens ("", "-", "ND", en/em dashes) to
readxl::read_excel(na = ...) so those cells are loaded as NA instead of
character strings cleaned up post-hoc (issue #4). Previously, a column
whose first rows were all "-" could be inferred as logical and later
numeric values silently dropped. clean_dash_values() remains as a safety
net but is now largely redundant for CPC/IGC.download_inep_file() now verifies downloaded files before caching them
(issue #3). Three checks run after the bytes hit disk: file size against
the server's Content-Length (1% tolerance, catches truncated downloads),
HTML-masquerade detection on the first 64 bytes (catches INEP maintenance
pages served with HTTP 200), and ZIP magic-bytes (PK\x03\x04) for .zip
destinations (catches proxy corruption). On any failure the corrupt file
is deleted and the user gets a clear error telling them to retry, instead
of a cryptic readxl / readr failure on the next call.validate_year() now rejects vectors and non-numeric input with a clear
error pointing at purrr::map_dfr() for multi-year composition
(issue #2). Previously, passing c(2017, 2019) to any of the 13
affected getters (get_cpc, get_idd, get_igc, get_capes,
get_saeb, get_enem, get_enem_itens, get_enade, get_encceja,
get_fundeb_distribution, get_fundeb_enrollment, get_censo_escolar,
get_censo_superior) hit either a cryptic length > 1 error (R ≥ 4.2)
or silently used only the first element (R < 4.2). get_ideb() is
unaffected — it intentionally accepts year vectors.extract_zip() cleaned up: removed dead if (TRUE) branch and an
unreachable cli_abort(); the muffle on extraction warnings was
tightened from the broad erro|error pattern to the two specific
messages that motivated it (issue #6).RoxygenNote bumped to 8.0.0 and man/*.Rd regenerated; systemfonts
and textshaping declared in Suggests: to silence the cosmetic
R CMD check NOTE about packages pulled transitively by pkgdown.R CMD check warnings cleared: em-dashes in cli_abort() message strings
in R/utils-download.R are now written with Unicode escapes (R requires
ASCII-only in code strings; comments are exempt).available_years() now dynamically discovers available years by querying
data sources (HEAD requests for INEP, OData queries for FNDE). Results are
cached per session. Falls back to a hardcoded list when offline.available_years() now accepts "fundeb_enrollment" as a separate dataset
name. Previously, "fundeb" was shared between distribution and enrollment.CO_*, CD_*) are now read as character instead of numeric
across all datasets. This prevents loss of leading zeros in codes like
municipality codes, course codes, and institution codes.get_enade() failing for 9 of 19 available years. INEP uses
inconsistent URLs for ENADE: _LGPD suffix for 2012-2019, .rar format
for 2022. Added hardcoded URL map (enade_urls) with all 19 correct URLs.get_fundeb_enrollment() accepting years with no data in the FNDE
API. The API currently only has data for 2017-2018.clear_cache() failing to delete files on Windows when they were
memory-mapped by readr. Now deletes entire directories and warns about
locked files..rar archive extraction support via 7-Zip. find_7z() searches
common Windows install paths when 7z is not in PATH.strip_diacriticals() internal helper for encoding-safe text matching.read_inep_file() now auto-detects code columns (CO_*, CD_*) from the
file header and reads them as character. No user action required.read_ideb_excel() and read_excel_safe() (CPC/IGC) now convert code
columns to character after reading.get_fundeb_distribution(): Download FUNDEB resource distribution data
(years 2007-2026). Reads all sheets from STN Excel files and returns
tidy long-format data with monthly transfer amounts by state, funding
source, destination (states/municipalities), and table type
(fundeb/adjustment).get_fundeb_enrollment(): Download FUNDEB enrollment data.
Fetches from FNDE OData API with automatic pagination. Results cached as CSV.uf, source (FPE, FPM, ICMS, etc.),
and destination ("uf" or "municipio").https://www.tesourotransparente.gov.br)
and FNDE (https://www.fnde.gov.br).get_capes(): Download CAPES graduate education data (years 2013-2024)."programas"), students ("discentes"),
faculty ("docentes"), courses ("cursos"), and theses/dissertations catalog ("catalogo").https://dadosabertos.capes.gov.br).get_cpc(): Download CPC data (years 2007-2019, 2021-2023; no 2020 edition).readxl package.get_igc(): Download IGC data (years 2007-2019, 2021-2023; no 2020 edition).read_excel_safe(): Internal helper to read Excel files with error handling.get_enem_escola(): Download ENEM results aggregated by school (2005-2015).get_idd(): Download IDD microdata (years 2014-2019, 2021-2023; no 2020 edition).extract_archive() utility.get_encceja(): Download ENCCEJA microdata (years 2014-2024).get_enade(): Download ENADE microdata.get_censo_superior(): Download Higher Education Census microdata (years 2009-2024)."ies"), courses ("cursos"), students ("alunos"), and faculty ("docentes").list_censo_superior_files(): List available files in a downloaded census.uf parameter.get_saeb(): Download SAEB microdata (years 2011, 2013, 2015, 2017, 2019, 2021, 2023)."aluno"), school ("escola"), principal ("diretor"), and teacher ("professor") questionnaires.level parameter.iconv() instead of validEnc()."Latin-1" encoding name to "latin1" for Windows codepage compatibility.type parameter for split files ("participantes", "resultados").dt_*).vl_* columns from character to numeric, handling "-", "ND", and comma decimals.get_ideb_series() now shows per-year progress indication (e.g., "processing IDEB 2017 (1/4)") and propagates the quiet parameter to inner get_ideb() calls.get_enem_itens() now has keep_zip parameter for consistency with get_enem() and get_censo_escolar().README.md) as default; Portuguese version renamed to README.pt-br.md with cross-links between both.@param year ranges in documentation to match available_years():
get_enem() / get_enem_itens(): 2009-2023 -> 1998-2024get_censo_escolar(): 2007-2024 -> 1995-2024@family tags to group related functions in help pages (ENEM, IDEB, School Census, cache).getting-started.Rmd).README.pt-br.md.enem_summary(): statistics calculation, NA handling, grouping by variable, and error on missing score columns.validate_data(): empty data, few columns, missing expected columns per dataset.\donttest with \dontrun in all examples per CRAN request.set_cache_dir() example that created a directory in the user's home (~/educabR_cache) during CRAN checks. Now uses tempdir() in examples.First public release.
get_ideb(): Download IDEB data (years 2017, 2019, 2021, 2023).get_ideb_series(): Download IDEB historical series across multiple years.list_ideb_available(): List available year/stage/level combinations.get_enem(): Download ENEM microdata (years 1998-2024).get_enem_itens(): Download ENEM item response data.enem_summary(): Calculate summary statistics for ENEM scores.get_censo_escolar(): Download School Census microdata (years 1995-2024).list_censo_files(): List available files in a downloaded census.set_cache_dir(): Set custom cache directory.get_cache_dir(): Get current cache directory.clear_cache(): Clear cached files.list_cache(): List cached files with metadata.available_years(): Get available years for each dataset.