A corpus lets you ask Spanish what Spanish does

When learners are unsure about usage, they often ask a teacher, dictionary, forum, or translation tool. Those can help. But sometimes the best next step is to look at many real examples.

A corpus is a structured collection of texts. A concordancer is a tool that shows search results in context, often with the search phrase centered. Together, they let you investigate patterns such as:

depender de

consistir en

por lo tanto

aunque sea

me parece que

tomar una decisión

The key principle is:

Corpus work does not replace judgment. It gives evidence for judgment.

A corpus shows what occurs in its data, not what is always correct, universal, or appropriate for your context.

Corpus, concordancia, frecuencia, colocación

Core terms:

corpus

structured text collection

concordancia

concordance line / occurrence in context

frecuencia

frequency

colocación

collocation

registro

register

metadatos

metadata

lema

lemma

A concordance line might show:

...depende de la situación...

...dependerá de los resultados...

...no depende sólo del precio...

Seeing many lines helps the learner notice patterns: depender de, not depender en in standard Spanish.

What corpus tools are good for

Corpus tools are especially useful for:

  • preposition patterns,
  • collocations,
  • register differences,
  • common phrase frames,
  • verb complements,
  • regional variants,
  • frequency checks,
  • before/after connector use,
  • natural examples.

Question examples:

Is it depender de or depender en?

Do writers say tomar una decisión or hacer una decisión?

What usually follows por lo tanto?

Is me parece que followed by indicative or subjunctive?

Where does aunque sea appear?

Corpus work turns vague uncertainty into observable data.

Preposition patterns

Spanish prepositions are a strong use case.

depender de

to depend on

consistir en

to consist of/in

insistir en

to insist on

soñar con

to dream about/of

A dictionary may tell you the pattern. A corpus shows examples across contexts.

For consistir en, you may see:

El proyecto consiste en crear una red de apoyo.

La diferencia consiste en que...

El método consiste en aplicar una serie de pruebas.

Now you learn not only the preposition but the structure that follows.

Collocation and translationese

Corpus work helps fight translationese. English says “make a decision.” Spanish generally says:

tomar una decisión

adoptar una decisión, in some formal contexts

A search for hacer una decisión may show rare, nonstandard, translated, or regionally influenced examples, but frequency and register will look different.

This is how corpus evidence helps. It does not only say “wrong.” It shows what native-like usage tends to prefer.

Connectors in context

Search for por lo tanto and you will see formal conclusion patterns:

Por lo tanto, es necesario...

Por lo tanto, no puede afirmarse que...

Los datos son incompletos; por lo tanto, la conclusión debe ser cautelosa.

Search for aunque sea and you may find concessive or minimal-condition uses:

Necesito verlo, aunque sea unos minutos.

Aunque sea difícil, debemos intentarlo.

The phrase is not learned as a translation only. It is learned as a family of contexts.

Register and genre

A corpus may include newspapers, fiction, academic writing, subtitles, legal texts, blogs, or spoken transcripts. Usage varies by genre.

A phrase common in legal Spanish may sound stiff in conversation. A phrase common in social media may be inappropriate in academic writing. Corpus results must be filtered by register when possible.

Useful metadata questions:

Is this example from speech or writing?

Which country?

What year?

What genre?

Is it edited text or user-generated text?

Frequency is not enough

High frequency does not automatically mean appropriate. Low frequency does not automatically mean wrong.

A rare word may be correct but specialized. A common form may be colloquial or nonstandard. A corpus with many newspaper texts may overrepresent journalistic style. A web corpus may contain many errors and translations.

Corpus literacy means asking:

Frequent where, among whom, and for what purpose?

A learner corpus workflow

Suppose you want to know whether to write:

depende de tu nivel

or

depende en tu nivel

Search depende de and depende en. Compare counts. Read examples. Check source types. You will see depende de overwhelmingly in standard Spanish.

Then create learner notes:

depender de + noun

depende de la situación

depende de tu nivel

dependerá de los resultados

The result is not just a correction. It is a reusable pattern.

Example bank walkthrough

Depender de demonstrates preposition pattern learning.

Consistir en shows a verb plus preposition plus infinitive/noun/que clause.

Por lo tanto shows conclusion register and punctuation.

Aunque sea shows concessive and minimum-condition uses.

Me parece que helps test mood and stance in real examples.

Tomar decisión should normally appear as tomar una decisión or related variants; corpus work reveals collocation.

Corpus-use workflow

  1. Write one focused usage question.
  2. Search the suspected phrase.
  3. Search the alternative phrase.
  4. Read at least 20 real examples when possible.
  5. Check country, genre, and date.
  6. Look for repeated patterns.
  7. Ignore obvious duplicates or machine-translated junk.
  8. Compare with dictionary or grammar source.
  9. Save a pattern with examples.
  10. Use the pattern in your own sentence.

Mini-workshop: investigate one phrase, not the whole language

Open a corpus with one narrow question: tomar una decisión or hacer una decisión? Search both. Read examples. Ignore duplicates and obvious translation noise. Then write a conclusion with a confidence level: For standard edited Spanish, I will use tomar una decisión. Corpus work fails when the question is too broad. It succeeds when the learner makes one practical decision from evidence.

Corpus-result hygiene

Good corpus work includes cleaning. Remove duplicates. Ignore navigation text, boilerplate, and lists when they do not represent natural usage. Check whether examples come from newspapers, fiction, subtitles, academic prose, social media, or legal documents. A phrase common in legal notices may sound strange in conversation; a phrase common in subtitles may be too informal for an article.

Learners should also record negative evidence carefully. If a searched phrase produces very few clean examples, do not simply conclude that it is impossible. It may be rare, regional, specialized, or hard to search. Try related forms, inflections, and context words. Search tomar una decisión, toma de decisiones, and decidir if the question is about decision language.

The final note should include a confidence level: strong evidence, moderate evidence, unclear, or needs native/reference check. This prevents corpus tools from becoming a machine for overconfident answers.

Remediation drill: design a corpus question that can actually be answered

Bad corpus question:

How do Spanish speakers use por?

Better corpus question:

In formal written Spanish, which phrase is more common before a conclusion: por lo tanto or por consiguiente?

Even better:

In academic prose, what verbs commonly follow los resultados sugieren que?

A corpus is not a magic oracle. It answers well-shaped questions. The learner must define phrase, register, region if relevant, and purpose.

After searching, inspect concordance lines and tag patterns. For por lo tanto, you may find conclusion statements. For por consiguiente, perhaps more formal or written contexts. For en consecuencia, perhaps institutional or argumentative prose. The answer is not just frequency; it is register and usage.

Now write a learner rule:

Por lo tanto is broadly useful for conclusions. Por consiguiente may sound more formal. I should choose based on register, not treat them as identical decoration.

For collocation repair, use corpora to fight translationese. Search the Spanish phrase you are tempted to write and the Spanish phrase you suspect is better. Read examples. Keep three clean model sentences. Then write your own sentence. Corpus study should end in production.

Suggested interactive module: corpus-query worksheet

A strong tool would guide learners through small corpus investigations.

Suggested functions:

  1. Question builder: preposition, collocation, connector, mood, register.
  2. Query pairs: correct candidate versus suspected calque.
  3. Concordance viewer: phrase centered with context.
  4. Metadata filters: country, genre, year, spoken/written.
  5. Pattern extractor: common words before and after.
  6. Noise flag: duplicates, OCR errors, machine translation, quotations.
  7. Learner conclusion field: “I will use X in Y context.”

Applied corpus drill: compare two candidates

Question:

Should I write depende en or depende de?

Search both. Then inspect concordance lines. If depende de appears repeatedly in examples such as depende de la situación, depende del contexto, and depende de cada persona, the pattern is clear. If depende en appears mostly in translations, errors, or special contexts, do not treat raw hits as equal evidence. The corpus answer is not just frequency. It is frequency plus clean examples plus matching context.

Remediation focus: using corpora as evidence without pretending they are oracles

A corpus is powerful because it shows Spanish in use. It is dangerous because learners can mistake search results for final authority. Concordance lines are evidence from a sampled collection, not a complete vote by all speakers. Frequency depends on genre, country, time period, source selection, and search design. Corpus literacy requires humility.

The remediation move is to ask one narrow question at a time. Do not search a corpus to “learn por and para.” Search whether depender is normally followed by de, whether consistir pairs with en, whether tomar una decisión appears more naturally than realizar una decisión, or how por lo tanto behaves in academic prose. A corpus answers good questions better than vague ones.

Common failure modes to repair

  • Trusting raw frequency alone: A common form may be informal, quoted, regional, formulaic, or irrelevant to your target context.
  • Using bad queries: Searching only one inflected form may miss the pattern. Searching too broadly may collect noise.
  • Ignoring metadata: Country, genre, date, and medium matter.
  • Treating one example as permission: One corpus hit can be an error, quotation, joke, foreignism, or old usage.

Before/after: turn a vague corpus question into a usable query

Weak version:

Is this Spanish correct?

Stronger version:

In contemporary written Spanish, does consistir normally take the preposition en before a noun phrase or infinitive? Compare concordance lines for consiste en, consistió en, and consisten en.

The stronger question names register, period, verb, construction, and forms to search.

Upgrade workshop: one-question corpus worksheet

  1. Write the exact usage question in one sentence.
  2. Choose the target register: academic, news, spoken, fiction, web, legal, regional.
  3. Search two or three forms, including inflection or nearby words.
  4. Open concordance lines and discard irrelevant hits.
  5. Record the construction pattern, not only the count.
  6. Write a learner conclusion with caution: “In this corpus and register, the dominant pattern is…”

Quality-control checklist

  • Does the corpus include enough examples from the target country?
  • Are you comparing equivalent forms, not one broad search against one narrow search?
  • Do examples come from edited prose or user-generated text?
  • Is the word part of a quotation, title, or foreign-language fragment?
  • Have you checked collocation and preposition, not just existence?

Applied remediation drill: interpret concordance lines instead of counting blindly

Use this source-style excerpt:

Search question: Which is more natural in edited Spanish, “depende de si” or “depende si”? Concordance lines show frequent “depende de si” in formal writing, while “depende si” appears in informal web comments and regional speech.

A fast but weak reading might say:

Both forms exist, so both are equally good everywhere.

That reading is incomplete. A stronger reading says:

Both forms may appear, but corpus metadata suggests depende de si is safer in edited formal Spanish, while depende si may be informal, regional, or less standard depending on context.

The repair comes from five checks:

  1. The query asks about a specific construction, not a whole verb.
  2. Existence is not the same as register-neutral acceptability.
  3. Edited prose and informal comments should not be weighted equally for formal writing.
  4. Regional patterns may explain some variation.
  5. The learner conclusion should name the target context.

Write the conclusion this way: For formal writing, I will use “depende de si.” I will recognize “depende si” in informal or regional contexts but will not use it as my default in edited prose. Corpus work is successful when it changes a writing decision, not when it merely produces numbers.

Final rule

A corpus does not give authority by magic. It gives examples.

Use Spanish corpora and concordancers to investigate depender de, consistir en, por lo tanto, aunque sea, me parece que, and collocations like tomar una decisión. Ask focused questions. Read context. Respect register. Let evidence sharpen your judgment.