Have you ever made your own language?
The idea of making an entire language might sound absurd. After all, languages coalesce over thousands of years, from the contributions of millions of people, they can't just be cobbled together by a single person, can they?
But sure enough, there are plenty of people who do exactly that. Perhaps the most famous was J.R.R. Tolkien, author of The Lord of the Rings, who said that he created Middle-Earth to give life and depth to his constructed languages, and not the other way around. A more recent conlang celebrity is David J. Peterson, who created languages for Game of Thrones and Dune, among others.
This project aims to distill the many choices that make up language construction into friendly UI that is accessible even to people with no formal linguistic knowledge. It also handles the problem of organizing and storing all the information that make up a language, without the need for spreadsheets or messy notebooks.
The basic roadmap is this:
-
Construct the phonology, the sound system, of the language.
- Which consonants and vowels will your language have?
- In what order can those phonemes appear? What consonant clusters and diphthongs are allowed in your language?
- How have sound changes altered the phonology of the language over the (simulated) centuries of usage?
-
Configure the morphology, or word-level grammar of the language
- What sorts of grammatical categories, like tense or plurality, does the language mark?
- How are they marked? Suffixes, prefixes, something else?
- Can new words be formed by compounding other words together, and if so, how?
-
Decide the syntax, or word order, of the language
- Do adjectives come before or after nouns? Prepositions or postpositions?
- Are any grammatical categories marked through word order rather than morphology?
-
Create the lexicon, the words of the language
- Which concepts does the language have words for, and which are derived from other words?
- Does the language borrow words from other languages?
How it's made
The application has gone through two major iterations.
The Language Generator (v1)
This first iteration used Vue.js on the frontend, with templates written in Pug, basic styles drawn from Bootstrap, and logic written in CoffeeScript.
I used both Pug and CoffeeScript at my first job as a developer, and I still appreciate the readability of them, though I haven't used them recently. One of the issues I ran into was a lack of IDE support. It's easy to underestimate the importance of autocomplete and syntax highlighting until you have to go without it.
While CoffeeScript is mostly forgotten about today, it's mostly because some of its best features were adopted into JavaScript, including the spread operator ...
, array and object destructuring, arrow functions, and others. Perhaps TypeScript may someday suffer the same fate...
The backend was written in Go, with PostgreSQL as the database. For development, I ran the backend and database together with Docker Compose.
The algorithm for generating words stayed the same between both iterations, and is discussed in detail below.
Ultimately, I abandoned this iteration for a few reasons. On the front-end, I was starting to grow tired of CoffeeScript. The lack of IDE support, and the lack of explicit types, was resulting in too much time spent debugging. On the back-end, CRUD operations were getting too verbose. The custom data structures that the project required, paired with the need to marshal and unmarshal JSON with each operation, had me craving the simplicity of a Node API and MongoDB.
Conlang Workshop (v2)
For the rewrite, I decided to streamline CRUD operations by passing them through a simple Node API and into a Mongo database with virtually no intermediate steps. This Node API served as a gateway, handling user authentication and CRUD while saving the heavier operations for a much simpler Go back-end that I call the "language service".
The language service communicates only with the Node gateway, as does the database. Again, I used Docker Compose to run the three components (gateway, language service, and database) together.
The front-end was an experiment outside of the usual Single-Page Application architecture. It uses Astro to implement the Islands architecture, with the dynamic components written in Svelte. For basic styles I used Bulma.
This experiment with Islands was interesting, and while I think a lot of projects would be better suited to an Islands architecture than to a SPA, I don't think this was one of those projects. Something I often struggle with in front-end development is finding the right balance between local state, global state, and persistent data. I think the Islands architecture, by requiring a multi-page application, led me to lean too heavily on persistent state. Each stage of the application should really be completed in sequence, and by saving progress for every little sub-entity, I had to handle the possibility that users would skip steps, do them out of order, or revisit steps, and that started to get messy and error-prone. Of course, the same mistake could have been made with a SPA, but I think it was trying to conform to an Islands architecture that started me down that slope in the first place. Another factor in ending v2 was that I was starting to encounter the limitations of the word-generation algorithm, which I discuss below.
The Word-Generation Algorithm
This algorithm was my favorite part of the project to design and implement, and its really the core of the whole application. The algorithm is essentially a Markov chain constructed according to a sonority hierarchy.
In the UI, the user configures a sonority hierarchy for their language. Each phoneme is either higher or lower in sonority, and syllables generally have phonemes with low sonority at the beginning and/or end of a syllable and phonemes of high sonority sandwiched in between. A sample sonority hierarchy might look like this, from highest to lowest sonority:
- Vowels - /a/, /e/, /o/
- Approximants - /r/, /w/
- Stops - /p/, /t/, /k/
- /s/
The algorithm picks a (weighted) random phoneme from one of the lower tiers, then forms the rest of the syllable by travelling up the sonority hierarchy, possibly skipping tiers. (For the end of the syllable, it turns around and traverses a similar sonority hierarchy in reverse order, from highest sonority to lowest.)
For example, the algorithm may pick the lowest tier, starting with the phoneme /s/, then pick /k/ from the tier above it, then skip the second tier, and pick /a/ from the top tier. The syllable at this point in the algorithm is then /ska/. The next iteration, generating a new syllable, may start with the third tier, picking /p/, then /r/ from the second tier, and finally /o/, resulting in /pro/.
Without adjusting the probabilities of certain sequences, the model produces words which are pronounceable, but that don't really sound natural. To fix this, I wrote a Python script to scrape phonemic transcriptions from Wiktionary for a variety of languages, and fiddled with the data for a while to see what patterns I could find. With my findings I was able to apply some adjustments to the algorithm and produce far more naturalistic output.
Nonetheless, I'm starting to believe I've reached the limits of this type of model. I'm considering if it would be possible to upgrade the Markov chain to a "second-order" version. While currently the model considers only the previous phoneme when weighting the probabilities of the next phoneme, a "second-order" model would consider the previous two phonemes.
Try it for yourself
Unfortunately, I have no publicly available version of the application deployed at the moment, nor do I plan to deploy one in the near future. I formerly had the v1 app available on Heroku (I think the dead link is still in the README), but when they ended their free tier I did not redeploy it. The v2 app is not particularly stable at the moment, so I'm not planning on deploying that either.
You can still view the source code for the v1 app on GitHub. The front-end is available here, and the back-end here.