A corpus is not a magic authority

A serious learner eventually asks questions that a textbook cannot answer cleanly.

Which is more common: por eso or por lo tanto? Is coger safe everywhere? Do people actually say he comido or comí more? Does quizá take subjunctive or indicative?

A corpus can help. A corpus is a large, searchable collection of real texts or speech samples with metadata such as country, date, genre, or source type.

But a corpus is not a magic authority. It does not tell you automatically what you should say. It shows evidence, and evidence must be interpreted.

The key principle is:

Corpus thinking means asking narrow usage questions, reading examples, and turning patterns into cautious learner rules.

Corpus, concordance, frequency, register, metadata

A corpus is the collection.

A concordance is a display of examples containing your search term, often with words before and after it.

Frequency tells you how often something appears.

Register tells you the kind of language: conversation, fiction, news, academic prose, legal text, social media, institutional writing, and so on.

Metadata tells you where an example comes from: country, date, genre, speaker, publication, or source.

These concepts matter because raw counts can mislead.

If a phrase appears often in legal texts, that does not mean it is normal in conversation. If a form appears often in Spain, that does not mean it is equally common in Mexico. If a corpus has more newspapers than speech, it will overrepresent journalistic style.

Ask narrow questions

Bad corpus question:

Is por eso better than por lo tanto?

Better question:

In contemporary written news Spanish, is por eso or por lo tanto more common in causal transitions?

Even better:

In opinion columns, does por lo tanto appear more often in formal argument than por eso?

A corpus rewards precise questions.

For learners, a useful question has:

  1. a specific expression,
  2. a comparison term,
  3. a register,
  4. a region or pan-Hispanic scope,
  5. a practical output.

The output should be a learner rule, not a universal law.

Read examples, not only counts

Suppose you compare:

por eso

and:

por lo tanto

A frequency count may show that both are common. But examples reveal usage.

Por eso often points back to a cause and explains a result in a direct way.

Estaba lloviendo. Por eso no salimos.

Por lo tanto often sounds more formal, argumentative, or inferential.

Los datos son insuficientes; por lo tanto, no es posible confirmar la hipótesis.

The count alone does not teach this difference. The examples do.

Learner action: always read at least 20 examples before turning a corpus result into advice.

Regional vocabulary needs corpus humility

Consider:

tomar / coger / agarrar

A learner wants to know which means “take.” The answer depends on region, object, context, and taboo meanings.

Coger is normal in Spain for many “take” contexts, but can be vulgar in parts of Latin America. Tomar is widely useful for drinks, transport in many regions, medicine, decisions, and some objects. Agarrar is common in many American varieties for grasping or taking.

A corpus can show distributions, but only if the corpus is regionally balanced and tagged well.

Learner action: do not use a global count to erase regional meaning.

Tense variation: he comido and comí

The contrast between he comido and comí is not only grammar. It is also region, discourse, and time framing.

In much of Spain, the present perfect is common for recent past connected to the present day:

Hoy he comido temprano.

In many American varieties, the preterite may be more common in the same context:

Hoy comí temprano.

A corpus can help you see broad tendencies, but you must check region and genre. A corpus of Spanish newspapers from Spain will not answer how people speak in Lima or Bogotá.

Learner action: define your target variety before interpreting tense frequencies.

Mood variation: quizá venga and quizá viene

Spanish allows both indicative and subjunctive after uncertainty markers such as quizá/quizás, with meaning and stance differences.

Quizá viene.

Maybe he/she is coming. The speaker may present it as somewhat likely or treated as a real possibility.

Quizá venga.

Maybe he/she will come. The speaker marks uncertainty more strongly or frames it as less asserted.

A corpus search can show both forms. It cannot replace grammatical analysis. The question is not merely which is more common; the question is what stance each form signals in context.

Common corpus pitfalls

Pitfall 1: Small samples

Ten examples are not enough for a broad conclusion.

Pitfall 2: Genre bias

A corpus full of newspapers makes formal public language look more common than it is in conversation.

Pitfall 3: Regional bias

A pan-Hispanic corpus may still have uneven representation.

Pitfall 4: Search-form blindness

Searching dámelo may miss da me lo in nonstandard spacing, dámela, dármelos, and other related forms.

Pitfall 5: Homographs

como can be a verb form or conjunction. and se are different words but accent marks may matter in search.

Pitfall 6: Confusing frequency with suitability

A frequent form may be informal, regional, vulgar, bureaucratic, or inappropriate for your goal.

From corpus evidence to learner rule

A responsible workflow:

  1. State the question.
  2. Choose the corpus.
  3. Limit by region/register if possible.
  4. Search exact forms and variants.
  5. Read examples.
  6. Remove irrelevant hits.
  7. Compare contexts.
  8. Form a cautious rule.
  9. Test the rule against another source.
  10. Save examples.

Example learner rule:

In formal argumentative writing, por lo tanto often sounds more inferential and explicit than por eso. In everyday explanation, por eso is often more natural.

That is useful and modest.

Example bank walkthrough

por eso / por lo tanto

Both express consequence, but register and argumentative force differ.

Learner action: compare examples in conversation, essays, and news.

tomar / coger / agarrar

Regional and domain-sensitive “take” verbs.

Learner action: never interpret global frequency without regional notes.

he comido / comí

Present perfect and preterite distribution varies by region and time framing.

Learner action: choose examples from your target variety.

quizá venga / quizá viene

Subjunctive and indicative both occur with mood/stancet differences.

Learner action: read context, not just the word after quizá.

Remediation notes: corpus evidence needs a question, a filter, and humility

Corpus thinking needs one major remediation: a corpus does not answer vague questions well. “Which word is better?” is usually too broad. A good corpus question includes region, register, structure, and comparison.

Weak question:

Is coger used in Spanish?

Better question:

In contemporary written news from Mexico, how often does coger appear with transport meanings compared with tomar?

Even better:

In spoken interviews from Spain and Mexico, what objects appear after coger, tomar, and agarrar?

This makes corpus work teach usage instead of confirming prejudice.

The article should also warn about raw frequency. A form may appear frequently because the corpus contains many texts from one country, one genre, one period, or one topic. Legal texts overrepresent legal vocabulary. News texts overrepresent politics, crime, sports, and official speech. Fiction overrepresents narrative tense and dialogue. Social media, when included, may overrepresent slang, spelling variation, and short-form discourse.

Concordance reading matters more than the count. Twenty lines of examples can teach what a number cannot: what subjects occur, what objects follow, what prepositions appear, whether the phrase is formal, whether it is quoted speech, whether it is regional, and whether the meaning is literal or idiomatic.

A responsible corpus workflow:

  1. Define the exact comparison.
  2. Choose corpus filters: country, period, oral/written, genre.
  3. Search forms and, when possible, lemmas.
  4. Read concordance lines in context.
  5. Remove irrelevant meanings.
  6. Check whether examples cluster in one region or genre.
  7. Convert the evidence into a learner rule with limits.

Example learner rule after corpus work:

Tomar una decisión is a broadly useful collocation in formal and informal Spanish. Hacer una decisión appears mainly under English influence or in nonstandard contexts, so I will avoid it in careful writing.

The final humility point: corpora show attested usage; they do not automatically tell you what is appropriate for your situation. A rare form may be perfect in poetry. A frequent form may be too informal for a legal notice. A corpus is a tool for disciplined curiosity, not a replacement for judgment.

Repair rule:

A corpus answer is only as good as the question, filters, and examples behind it.

Suggested interactive module: beginner corpus query worksheet

A strong tool for this article would make corpus use less reckless.

Suggested functions:

  1. Question builder: expression, comparison, region, register, goal.
  2. Variant generator: conjugated, accented, plural, clitic, and spelling variants.
  3. Concordance reader: show examples with surrounding context.
  4. Noise filter: mark irrelevant hits and homographs.
  5. Metadata panel: country, genre, date, source.
  6. Count caution: warning when sample size or bias is high.
  7. Rule composer: turn evidence into a modest learner rule.

Final rule

A corpus does not replace judgment. It gives evidence.

Ask narrow questions. Read real examples. Check region and register. Beware small samples, genre bias, and misleading counts.

Corpus thinking is not about sounding scientific. It is about becoming harder to fool.