Learning a new language is most efficient when you focus on the words you will actually encounter. Research in linguistics shows that a relatively small number of words account for the vast majority of everyday text. In Turkish, roughly 1,000 words cover about 80% of what you read in the news.
This project was built to answer a simple question: which Turkish words should a learner study first? Instead of relying on textbook vocabulary lists, it analyzes real Turkish news articles to produce a frequency-ranked word list grounded in how the language is actually used today.
The dictionary is generated by a two-stage pipeline that scrapes, processes, and serves Turkish word frequency data.
1. Scraping. A Python scraper fetches articles from Turkish news RSS feeds (such as TRT Haber), extracts the article text, and stores them locally.
2. Lemmatization. Turkish is an agglutinative language — a single root word can take dozens of suffixes. The analyzer uses Zeyrek, a morphological analyzer for Turkish, to reduce inflected forms back to their dictionary lemmas. A stemmer serves as a fallback for words Zeyrek does not recognize.
3. Frequency analysis. The pipeline counts occurrences of each lemma, ranks them by frequency, and calculates cumulative coverage percentages. It also filters noise words (connectors, particles) to keep the list focused on meaningful vocabulary.
The resulting frequency list is served as a static JSON file and rendered by a vanilla JavaScript frontend with real-time search filtering.
The full source code — including the scraper, analyzer, and web frontend — is available on GitHub.