Tiny but Mighty: Extracting Complete Article from Screen-Recording Using Local Models

Imitating capabilities of SOTA models with stuff that can run on your laptop

Shaked Zychlinski 🎗️
6 min readJan 27, 2025
Made with Dalle

Source code is available on my GitHub page: shakedzy/screen-reader

Not too long ago, a friend told me he saw a tweet on X, where someone demonstrated how they send Google’s Gemini a screen-capture video of himself scrolling through a webpage, and asking the model to extract the full article from the video (I wish I saved the link, if you’ve seen it too — send it to me!). The catch is, of course, that each frame only showed a portion of the full article. While Gemini was able to complete the task successfully, it made me wonder — can this be done with small models, running on my laptop?

The Plan

The first thing I had to solve is the fact that (at least at time of writing) there are no small models which accept videos as input. That means I’ll have to break the video to frames, and work on each one of these separately. Then, as the text on each frame should be quite clear, and I can use a small vision-model like Llama-3.2-Vision-11B to get the text out of each frame, and use another small LLM (like Llama-3.1–8B) to stitch it all together.

An abstract overview of the plan to extract text from a video

Splitting to Frames

The first thing I realized was the splitting the video to frames can’t be as naive as I thought. I’m using a Mac, and MacOS built-in screen recorder saves the recording as a mov file with 60 frames per second. That means a 25-seconds screen capture video (like this one in my example) contains a little more than 1550 frames (!!), which is way too much to process — especially when the difference between each frame is almost none. So I’ll need a way to filter these, and keep only those that have meaningful difference between them.

To handle this effectively there’s no real need of any AI models what’s so ever. The only thing I really need is to count the number of different pixels between each pair of frames, and set an acceptable threshold determining when there’s a significant change. I used OpenCV for this, and managed to reduce the 1500+ frames to a little more than 30 💪.

Two frames with meaningful difference of Duck on Wikipedia, as found by the algorithm

Extracting text

Once I had narrowed down the number of frames, I assumed the rest would be easy — throw each frame on a vision model to extract the text, and ask a another model to stitch it all together. I used the quantized models of Llama-3.2-vision (11B) and Llama-3.1 (8B) by Ollama for this, running on a MacBook with 36GB RAM.

But I quickly realized I was way too optimistic. The models failed in different ways — ignored important text, did-not-ignore unimportant text (like adding the URL or other links to the text itself), or simply hallucinated and extracted text that did not exist. That made the entire stitching section completely impossible, as the model (or even I) couldn’t tell what goes where, what’s missing and what should be left out. No matter how much I tried to fine-tune the prompts, it just never yielded a good-enough result.

At that point I thought I’ll just give up, but unfortunately I already told my friend something like “sure I can do this with local models, that’s no problem”, So I kind of had no choice but to get it work. I mean, I could tell him I failed, but — no, I’m not doing that.

Plan B

Realizing I can’t just ask the models to do the hard work without some guidance, I thought how can I give them a slight nudge in the right direction. So I decided to go with a different model — Microsoft’s Florence-2. Florence has two main advantages over any other VLM (Vision-Language Model):

  1. It’s very (very) small (~0.5GB for the base version, and ~1.5GB for the large version)
  2. Its OCR ability also returns the text location on the screen

But Florence also has a major disadvantage when compared to other VLMs — Florence doesn’t try to understand the text, it returns it as it is. That means that two lines of the same text are two different texts in the eyes of Florence. I’ll need to stitch everything together.

OCR polygons of Duck on Wikipedia, as extracted by Florence-2

To do this, I used some common-sense techniques. While none are bulletproof, they seem to get the job done in a variety of cases.

  1. The article text is always aligned in the same manner. While I don’t know what it is, I assume the vast majority of the screen recording is of the article it self. If all article lines are formatted in the same way, this means that the most recurrent left-alignment (the left x-value of all bounding boxes) across all frames is probably the alignment of the article text. Finding this is fairly easy, and from this point forward I can assume only bounding boxes with these x-values are the article’s text.
  2. Now, as I have all (and only) the article’s lines, I check the spacings between the lines:
    -> The shortest spacing is most probably inter-paragraph spacing. These lines, are belong to the same paragraph.
    -> The second-shortest spacing is probably paragraph spacing.
    -> Any other spacing is probably due to ads or images, and therefore ignored.

With this, I can easily extract article text and even paragraph-structure directly from the data I got from Florence-2. The stitching also solves itself — assuming I have sufficient text repetition between consecutive frames (and I should, as I define the frame-difference threshold), all I need to do is find where the text at the end of one frame is located in the next one, and merge them together. Fairly easy, and works great!

What About the Title?

But is really all of the article’s text aligned in the same manner? No. There’s one thing that is almost guaranteed to look different — the title. For this, I once again seek assistance from Llama-3.2-vision. I used another assumption — the title of the article will be in the first frame which had some of the article-text in it. That is because I don’t assume the recording starts when you’re already viewing the article (as you might be googling it, like I did in my example). But even then, it’s a fair assumption that the title will be close the article-text and near its top. So I sent only the first frame which had article-text in it to Llama, asking it to extract the caption — a task it had no issue accomplishing.

And that’s it — it’s done.

Final Remarks

The world of Generative AI is divided, roughly, to two major families — the ones on the Olympus: GPT, Claude, Gemini, etc — and those who are slowly climbing up, but are still quite far behind: open-source small models. While far less intelligent, the magic which small models are is incredible, and this project is only one demonstration to what can be achieved on a single regular laptop with their power plus some common sense. The options are limitless, and I can only encourage you to try and take another step up the mountain too!

from Giphy

--

--

Shaked Zychlinski 🎗️
Shaked Zychlinski 🎗️

Written by Shaked Zychlinski 🎗️

Lives in Tel-Aviv, Israel 🇮🇱 See me on shakedzy.xyz

No responses yet