“Hmmm… I guess trying to build a movie script generator could be a fun side project.”. Those were my thoughts about a week prior to starting to work on a movie script generator based on GPT-2. I decided to start working on a side project which was related to NLP (natural language processing) since my professional experience at the time of this writing was mostly related to computer vision. I also wanted to try out Hugging Face (a very popular machine learning library), since I read (and heard) good things about it, but have never used it up until that point.
I won’t be going into the technicalities of how GPT-2 works and how text is generated. For that, I refer you to this wonderful illustrated GPT-2 guide. For reading this article, you need to know that in order to generate text you need to supply a prompt and that GPT-2 generates follow-up text based on that prompt. The length of that “follow-up text” generated by GPT-2 is determined by sequence length – the maximum number of tokens (words) GPT-2 can generate in “one go”, before you have to feed it with another prompt.
As I later found out, I was “battling a dragon” in the sense that I was trying to generate an entire movie script using GPT-2. At the time of this writing (6th of December 2022) this is a very hard task and, as far as I know, is an active area of research. I quickly became aware that GPT-2 generates texts of certain sequence length, but I thought that by thoughtfully designing the prompt I feed to GPT-2 I could generate any text I wanted indefinitely. The prompt design, as we will see, did play a role in the quality of (some) GPT-2 outputs, but in my mind it was far easier than it turned out to be.
All of the code for this project can be found in this GitHub repository.
Let me lead you through the project as it sequentially unfolded in time. I started with getting the dataset which I would use to fine-tune GPT-2 to generate movie scripts.
Getting the dataset (data scraping)
Before any fine-tuning, I had to get the dataset I wanted to fine-tune the GPT-2 model on. I decided to scrape the The Internet Movie Script Database (IMSDb). I did that using Scrapy. The basic gist of using Scrapy is that you write functions which parse the web page content you are scraping. In my case, I had to follow (“click on”) multiple links in order to get to the movie script and then I had to extract the movie script from the web page. While constructing the scraper in question, I found the Scrapy tutorial and this XPath tutorial to be quite helpful. Thisis the spider that does the scraping. It places all of the movie scripts in the genre folder in which they belong. If a movie script belongs to multiple genres, it is duplicated.
Once I was done with getting the dataset, I then structured it in a way which suited me. For example, I stored the movie script both as a
.txt and a
.html file, but I decided to make two folders,
html and that each one of those folders has a subfolder whose name is the genre name. These are technical details and if you want to know more, feel free to look at this project’s GitHub repository.
After getting the dataset, my next step was to enable the fine-tuning of the GPT-2 model.
Writing the fine-tuning code
I started out by googling something along the lines of “fine-tuning huggingface”. What I found was the following Hugging Face guide on fine-tuning a pretrained model. I tried both fine-tuning with the Hugging Face Trainer class and fine-tuning in native PyTorch. Both of these approaches produced multiple errors and after a while I quit trying to make them work.
What I found when I googled “fine-tuning huggingface gpt-2” was this web page which links to a Colab notebook which contains code for fine-tuning GPT-2. My fine-tuning (and text generation) notebook is heavily based on that notebook. I found the code in that notebook to work (almost) out-of-the-box and it was very readable, so I stuck to it.
Technical intermezzo: where did I fine-tune GPT-2 and how do I load the data?
I wasn’t able to fine-tune GPT-2 on my laptop, which has 8 GB RAM and 2 GB VRAM. In order to get around that, I bought some Google Colab Pro compute units and ran fine-tuning on Google Colab. I could have maybe used Kaggle for this (since Kaggle offers significant amount of GPU time for free), but when I tried to upload my dataset to Kaggle it complained that filenames can’t contain certain characters, so I opted for Google Colab.
I used some of the Premium GPU(s) while fine-tuning. I had 83.48 GB of RAM. Even though I had that much RAM, I had to be thoughtful about designing my data loader class. If you take a look at the data loader code which I wrote in the Jupyter notebook, you will see that in the constructor I make a dictionary which stores the information about which movie script line index corresponds to which movie script file. When I tried to load up all of the movie lines from all of the movie script files into memory, I ran out of RAM, so that’s why my data loader class is designed this way – once asked for a certain movie script line at a certain index, it first looks at the dictionary which contains the index to movie script file mappings, opens that particular movie script file, reads that particular line and returns just that movie script line. Otherwise, as I said, I ran out of RAM.
Preprocessing the data and fine-tuning – version #1
My first idea was to structure the movie script lines as follows:
<genre name> | <original movie script line>
That is, I would add the genre name of the movie script in front of every line of that movie script and separate it with a vertical line (and a space before and after the vertical line). The reasoning behind this was that the GPT-2 needs some metadata in order to know which movie script belongs to which genre. My idea was to prompt GPT-2 with the genre name and that it should generate a movie line from that movie genre. During the preprocessing, I decided to keep the blank lines and the HTML markup elements since those make up the movie script as well.
I did this and I fine-tuned GPT-2 on movie lines structured like that. The fine-tuning ran for 4 hours and as far as I recollect, it did not complete one full epoch. The results weren’t so good. Here’s one particular output (when I prompted it to generate horror):
Important note: The raw output contains HTML markup and the genre name, but for readability purposes, I have omitted it from the output displayed henceforth. You can view the raw output (with the HTML markup and the genre name) in Notes.txt in the GitHub repository.
(a few blank lines)
that's a trick of love.
As you can see, all of these movie lines make sense in and of themselves, but in the context of the movie script as a whole they are incoherent.
This is where I tried another approach which was suggested to me by Piotr Antoniak, whom I thank for this suggestion and for various other kinds of valuable feedback as this project developed.
Preprocessing the data and fine-tuning – version #2
At this point I knew that putting the genre name before every movie script line is not going to produce coherent movie scripts. The idea suggested to me by Piotr (whom, again, I thank very much) was to structure the movie line as follows:
<genre name> | <first part of the preprocessed movie line> <genre name> | <second part of the preprocessed movie line>
Note that above I wrote “… preprocessed movie line”. What does that mean? Well, Piotr suggested that one line which I create contains as many words as possible (so, if GPT-2 which I use has maximum sequence length of 768 tokens (words), I should have 768 tokens per line) and that I add the genre name at the beginning of that line and in the middle of that line. The idea was that the prompt for further text generation can then be the middle of the previously generated line, that is:
<genre name> | <second part of the generated text>
and I could repeat this process for as long as I wanted.
This idea sounded good and since I already tried generating a movie script with just adding the genre name to the beginning of every movie script line, I tried this approach. There were other ideas on how to preprocess the movie script lines, but this one made the most sense to me. I wrote a script which preprocessed the movie scripts in the way Piotr described. I fine-tuned GPT-2 on that dataset. It was fine-tuned for 3 epochs and the fine-tuning ran for about 5 hours and 15 minutes.
The fine-tuned model produced garbage outputs on 7/10 runs (meaning it generated only the genre tag and nothing else). When it didn’t produce garbage, it produced somewhat coherent outputs, such as these (I prompted it to generate comedy):
No, please -- no, no.
It's just you. And you like the sound of him singing -- the way he sings that way.
Harry takes a sip of water from the tap. Helen looks at him.
Did you know that?
HARRY (putting the glass down)
No. Just it's not the best singing he's ever heard in his life.
Oh. Oh, okay. You were just one of those people.
Helen turns the water off, drains a few drops from the glass, and spits it out.
Is that all there is to it, then?
No, of course not.
Helen pours him another glass. He reluctantly takes a sip, watches her. Then, without looking at Harry, he walks to the edge of the bed, sits down on the edge of it.
I'm sorry. (quietly) I'm sorry.
Helen nods, satisfied, and puts her hands over his ears.
It's good to see you, too.
She stands, kisses him lightly on the cheek, then holds out the bottle for him to sit down next to her.
I've got a feeling that if the rest of them knew how much I enjoyed the singing, they'd ask me to join in, like I was part of some circus troupe of some kind. They'd ask me to join as a member of the troupe -- but, like a little girl, they're really not part of the troupe.
HARRY (kisses him back)
HELEN (she gets up)
Would you rather come over? (he kisses her) You know, the whole world would be much easier for me to live without you.
(kisses her again) Uh-huh.
It'll be easier if I join.
Of course it is. (putting his (Note: this is the end of the line (which holds 768 words))
Let me try another one. He starts a play which begins with a long whirl.
EXT. FRONT LAWN - NIGHT
Harvey, still in a tuxedo, is waiting for the light to come on. He is just about to step out onto the front lawn when a loud KNOCK knocks him over the head with a heavy duty air gun, dropping him to the ground. Harvey, holding his head, is bleeding profusely, dazed and dirty, white and blue as if he'd dipped his neck in water.
HARVEY (in pain)
INT. KIMBERLY'S BEDROOM - NIGHT
Kimbers is lying in bed and reading a paper, when she hears the DOOR BEING LOCKED.
A KNOCK, then another, louder. She gets up, gathers her papers and heads out, only to find...
What's goin' on?
Listen, I think you lost your wife.
Harvey gets up and goes into the room.
You can notice that the characters change at each (generated) line. This wasn’t the case for all lines. Usually, one or two characters persisted through multiple lines (but not always). One of the examples with consistent characters through multiple lines is:
BOBBIE (still shyly)
Annie starts walking towards him. Bobbie is holding the basket of popcorn for her as he walks over to open it. He opens it and looks inside, taking in the scene in the rain.
Come on. I'm so glad we're here.
How'd it go?
ANNIE (holding up the box)
It's about eight.
BOBBIE (holding out the popcorn)
Wow. It's a big baby.
I know. We were worried about a baby when we were in the neighborhood.
... (a new line)
It'll take a year to find it. You're going on a murder charge. The first murder charge is a murder charge. We're going to find a new one.
ANNIE (taking popcorn)
The crowd starts to disperse.
Hey, Annie. The children are outside the window and watching.
Bobbie reaches out to open the box and looks at it.
Annie looks up.
I love that baby.
EXT. NEW YORK CITY STREETS - EARLY MORNING
Annie and Bobbie walking down the street in their old school yearbooks. A car arrives at the corner. Annie jumps and runs to the car. Bobbie climbs in and the car pulls away.
The prompt in the example above was comedy as well. It’s not the most coherent output, but it’s way better than just outputting the genre tag and a blank line.
The outputs I listed here were taken from the Notes.txt file in the GitHub repository; there you can find more outputs I found interesting.
I also tried another prompt structure, namely:
<first half of the generated movie line> | <genre name>
but it didn’t produce any better results. I also experimented with some parameters when generating the text (for details feel free to look at the Jupyter notebook in the GitHub repository), but I didn’t see any significant improvement in the generated text.
At this point, I decided to wrap this project up. I had a few weeks break before revisiting and reviewing the code I wrote and after that was done, I decided I’ll do this write-up and close this project. Before I write up my conclusion, I list some technical notes and some future work suggestions.
- data scraper:
- there’s 11 movies that aren’t enclosed in
<pre>tags (either one or two); I ignored those
- some scripts don’t have a link which leads to the script reading page, so I ignored them
- if a script belongs to multiple genres, I duplicated it so that it is contained in each genre folder it belongs to
- within the movie_scripts_spider.py, I save the movie scripts into their own files in the
parse_movie_script_reading_pagefunction; as far as I know, this isn’t idiomatic to Scrapy and things like Item Pipelines should be used; I decided to keep my code the way it is because it works, but I’m noting this here
- there’s 11 movies that aren’t enclosed in
- prompt structure:
- as I already mentioned,
<genre name> | <second part of the generated text>generated garbage output about 70% of the time
- in regard to the
<first half of the generated movie line> | <genre name>prompt:
- that prompt very often generated a dialogue in French; I don’t understand why
- sometimes that prompt generated a different genre name than the one in the prompt (i.e. for a
COMEDYgenre it generates
FANTASYsomewhere in the generated line)
- sometimes the prompt above generates 2 genre names such that the second genre name is near the end or the beginning of the sequence (and not halfway)
- the above prompt generates good, but repetitive output about 40% of the time (where some of the lines were repeating themselves)
- I also tried generating new lines with the prompt being
<second half of the generated movie line>, but with no success
- as I already mentioned,
Here is a list of things which I think could be tried (in no particular order):
- strip the text of the HTML tags (and/or blank lines) and fine-tune the model
- fine-tune the model longer
- think of other ways to structure the prompt and test it
- additional parameter tuning during text generation
- if this model were to be deployed in production, it would need output parsing (since the output from the model is in one line and contains HTML markup)
- also, some thought would have to go into how to merge different lines which overlap (the overlap is usually caused by similar prompts)
This non-exhaustive list leads me to the conclusion.
I tried to build a movie script generator based on GPT-2. It didn’t work as expected, but it produces relatively coherent output some of the time. At the time when I started this project, little did I know that long text generation was an active research problem in NLP and that generating entire movie scripts is beyond the scope of the latest NLP models.
I am happy with this side project because I learned (or sharpened) my skills related to:
- data scraping
- data preprocessing
- machine learning, in particular:
- Hugging Face library
- fine-tuning a pretrained model
- using a fine-tuned model for inference
- natural language processing
- Hugging Face library
Although this project didn’t turn out quite as I planned, I have to say I was pleasantly surprised by the amount of fun I had while working on it. Working with text was fun for me and I very much liked experimenting with different prompts and seeing what output GPT-2 comes up with.
Feel free to experiment with the code I provided. Maybe you can find a way to generate longer coherent movie script sequences (or maybe you can find a way to generate entire coherent movie scripts; who knows!).
That’s it for this post. I hope you enjoyed it and that it helped you in some way.
Subscribe to my newsletter to keep abreast of the interesting things I'm doing. I will send you the newsletter only when there is something interesting. This means 0% spam, 100% interesting content.