Making of #IAN - Technical Appendix
TL;DR: Preprocessing text, fine-tuning with Google TRC and deploying. Not a lot of jokes.
When I published the Making of #IAN post last weekend, I thought it would be fine to skip the technical details, because, really - who cares? Boy was I wrong1. A lot of people reached out to me and asked me for details of how I did it2. I have a lot of other cool things I want to show you on this Substack, but providing more details on the process of building a #IAN appears to be reasonably high-priority. So, here it is, the “technical appendix”.
Extracting and preprocessing text.
First you want to collect the text to finetune your language model. This part is fiddly. But I think there is no way around that - the relevant data is spread over different platforms3 with different formatting that require different tools for extraction. Really, I think it’s almost surprising that this kind of information pooling is possible at all. Language is great.
In the next paragraphs I provide a high level description of what I did for each platform, but if you prefer jumping straight into the code, here is a Github repository.
Roam Research. I exported the database as one large .json file which I then stripped all metadata. I recursively traversed the json using this function and searched for the key string
. As postprocessing, it makes sense to either delete all block references (that's what I did) or to replace them with their reference (this obviously introduces duplicates to your training set and increases overfitting). Some Roam js plug-ins like the RoamEnhancement pdf highlighter introduces a lot of redundant code to your graph, consider removing this as well in preprocessing.
Web crawling. I played around with crawling links that I mention in my notes with the trafilatura
package. It's even relatively straightforward to automatically extract raw text from mentioned pdfs with the parser
from the tika
package. But this rapidly accumulated a lot of text - so much that it dominated the mix.4 So in the end I did not include this data at all. Perhaps some more selective crawling of abstracts or repeatedly referenced links can produce good results though.
WhatsApp. WhatsApp used to allow batch export of your entire chat history. This is no longer possible. There are tools that promise to do this still, but they did not appear worth the effort (looks a lot like adware to me). Instead I just manually exported some of my most wordy conversations and then massaged them a bit in python.
Facebook. Facebook does still allow to batch download all your data5. Conversations with users that are no longer on Facebook look a bit messed up formatting-wise, but otherwise this is very straightforward.
Email. Extracting email was so much harder than I'd expected. Apple Mail allows exporting your sent emails in the .mbox
format, which is already human-readable text. But removing attachments and most of the metadata took forever. I ended up hacking together a wonky solution based on this stackoverflow answer. All in all probably not worth the effort. It is a lot of data, but it's very repetitive and low in interesting information. As a result, #IAN loves to schedule meetings with everyone and everything, including inanimate objects and ghosts6... Perhaps applying some stronger filter can produce good results here.
Telegram. Telegram was harder than WhatsApp. It provides an API that can be queried with the telethon package, but they don't make it easy to scrape all the messages (good on them, actually). Check the GitHub repo linked above to see what I hacked together. That works well enough, but takes a while to run.
Books. eBooks are a fantastic source of high-quality text and if you are selective about the books you include, they are reasonably likely to reflect an aspect of your personality.7 I used pandoc
to turn .epub
files into raw text, which worked reliably and fast. It does introduce a bunch of newline characters where there should be none, so make sure to remove those.
Note on splitting. Since the learning rate decreases over finetuning, there is a bias to learn whatever is shown in the beginning most strongly.8 Splitting all the text into chunks helps a lot, but also destroys most long-range structure in your text. But since current language models are not able to harness those efficiently anyways, I don’t think a lot is lost. If models ever got large enough to extract very long-range structure then some other trick than splitting will be necessary.9
Finetuning GPT-J.
After collecting the text, you’ll want the language model to perform a few more gradient descent updates. To do this, I followed the HowTo from the GPT-J repository here. Note that if you want to train with Google TRC, you should apply for access before doing the text preprocessing because it might take a bit until your application is processed. Finetuning the model requires a .tfrecords
file, which you can generate from a collection of text files in the folder INPUT_FOLDER
by running
python3 mesh-transformer-jax/create_finetune_tfrecords.py --verbose --normalize-with-ftfy INPUT_FOLDER OUTPUT_NAME
after which you can run the device_train.py
script. Make sure to read the “Learning Rate Notes” from the GPT-J HowTo, this has a major impact on output quality. I also found that running multiple epochs runs into overfitting issues very quickly, so one epoch is sufficient.
Deploying.
Now you have a finetuned GPT-J model sitting in a Google bucket. How to talk to it?
Quick & dirty. Once fine-tuing is completed, you can use the device_serve.py
script to get a quick & dirty flask
server up and running that accepts POST
requests. After setting up port forwarding with something like:
./gcloud alpha compute tpus tpu-vm ssh
YOUR_TPU_PROJECT
--zone europe-west4-a -- -NL
LOCAL_PORT
:localhost:
REMOTE_PORT
you can use curl
on your local machine to query the model:
curl --header "Content-Type: application/json" --request POST --data '{"context":"I want to start a new journal club. It will be on the topic of", "top_p": 0.9, "temp": 0.75}' http://localhost:
LOCAL_PORT
/complete
Less quick & less dirty. Apart from the “WARNING: This is a development server. Do not use it in a production deployment.” message, there are other reasons why running the flask
server is not a great long-term solution. The full fine-tuned model takes up ~150Gb of working memory and takes around 10 minutes to start up. Once your free month of TRC is over, you will also have to pay 5-8$ per hour which makes running the model constantly & on-demand infeasible.
Slimming the model with slim_model.py
and then transferring the weights to Huggingface appears to be the way to go (keep an eye on the HowTo which has been updated regularly over the last few weeks on this point) to get something more robust & sustainable. Unfortunately, this comes at the cost of making further finetuning a lot harder. I’m still undecided how to proceed on this point.
Roam Research integration. Once you have an interface that can receive queries and returns generated text you can start thinking about deployment. In Roam Research, I decided to go with the Roam42 extension, which allows javascript execution and easy text extraction/insertion via SmartBlocks. The SmartBlock I wrote executes this snippet that sends an asynchronous HTTP request to the language model and inserts the result into your Roam graph.
Android keyboard integration. Creating a custom keyboard for Android is a very nice entry-level project for learning app development. There are a ton of tutorials out there and I found this one particularly clear and well written. The key java function that handels sending the POST request is here in my GitHub repo alongside some of the other customized files. As I’ve mentioned, you will have to install the gcloud
suite on Termux
and giving it all the permissions. Depending on how you deploy your model on the backend, this can be solved a lot more elegantly. Afaict volley is a good place to start.
And then?
And now the world is your oyster. If you build something cool, please consider letting me know about it (shoot me an email or write a comment)! If you’re having trouble with one of the steps also feel free to contact me or write a comment, perhaps I can help. Also, stay tuned for more upcoming posts on #IAN and other things and consider subscribing so you get a notification ☺️
I was also accused of employing the well-known academic hype-building trick of splitting research into multiple publications. I can only respond by proclaiming my full innocence and blaiming the institutional structures that have shaped me. On a complete tangent, see this eLife paper on the connectome of the drosophila mushroom body with a measly 192 pages, 33 videos, 41 Figures plus supplementary figures. “More is more” is the new “Less is more”.
Every email I get is highly appreciated 💚 Like, honestly, Alexey Guzey is onto something.
I found that a 70-30 split of “text I produced” to “books that I love” works well.
I might be reasonably approximated by a mixture of my five closest friends, but I'm certainly not a potpourri of text on the internet that I have (at best) skimmed for a few seconds.
I recommend not reading messaged from too long ago, very high risk of cringe.
#[[IAN says:]] Hey Spirit of the Christmas Past! Let's talk about that Special Presentation I was telling you about. I've set up a Zoom Meeting here: https://tum-conf.zoom.us/j/62084617861 Passcode 155693
See you soon!
Best,
Jan
I still went easy on books (only including my most beloved ones) because I didn't want them to drown out the text I produced myself. But perhaps a strategy is to do two passes of finetuning (first with a lot of books you enjoy with a small learning rate and then with the rest of the text and a larger learning rate). But this probably requires a bit of trial & error with the learning rates and I’m not sure it’ll be worth it.
Perhaps some very consistent use of tags that provide context for different sources of text?