Author: Gonçalo Valério

  • Django: Deferred constrain enforcement

    Another Friday, another Django related post. I guess this blog is becoming a bit monothematic. I promise the next ones will bring the much-needed diversity of contents, but today let’s explore a very useful feature of the Django’s ORM.

    Ok… Ok… it’s more of a feature of PostgreSQL that Django supports, and it isn’t available on the other database backends. But let’s dive in any way.

    Let’s imagine this incredibly simplistic scenario where you have the following model:

    class Player(models.Model):
      team = models.ForeignKey(Team, on_delete=models.CASCADE)
      squad_number = models.PositiveSmallIntegerField()
    
      class Meta:
        constrains = [
          models.UniqueConstraint(
            name="unique_squad_number",
            fields=["team", "squad_number"],
          )
        ]

    So a team has many players and each player has a different squad/shirt number. Only one player can use that number for a given team.

    Users, can select their teams and then re-arrange their player’s numbers however they like. To keep it simple, let’s assume it is done through the Django Admin, using a Player Inline on the Team‘s model admin.

    We add proper form validation, to ensure that no players in the submitted squad are assigned the same squad_number. Things work great until you start noticing that despite your validation and despite the user’s input not having any players assigned the same number, integrity errors are flying around. What’s happening?

    Well, when the system tries to update some player records after being correctly validated, each update/insertion is checked against the constraint (even atomically within a transaction). This means that the order of the updates, or in certain situations all updates with correct data, will “raise” integrity errors, due to conflicts with the data currently stored in the database.

    The solution? Deferring the integrity checks to the end of the transaction. Here’s how:

    class Player(models.Model):
      team = models.ForeignKey(Team, on_delete=models.CASCADE)
      squad_number = models.PositiveSmallIntegerField()
    
      class Meta:
        constrains = [
          models.UniqueConstraint(
            name="unique_squad_number",
            fields=["team", "squad_number"],
            deferrable=models.Deferrable.DEFERRED,
          )
        ]

    Now, when you save multiple objects within a single transaction, you will no longer see those errors if the input data is valid.

    Fediverse Reactions
  • Django: Overriding translations from dependencies

    This week, I’ll continue on the same theme of my previous “Django Friday Tips” post. Essentially, we will keep addressing small annoyances that can surface while developing your multilingual project.

    The challenge for this article shows up when a given string from a package that is a dependency of your project is either:

    • Not translated in the language you are targeting.
    • Translated in a slightly different way than you desire.

    As we are all aware, most packages use English by default, then the most popular ones often provide translations for languages that have more active users willing to contribute. But these efforts are laborious and can have differences for distinct regions, even if they use the same base language.

    Contributing upstream, might not always be an option.

    This means that to maintain the coherence of the interface of your project, you need to adapt these translations locally.

    Handling the localization of the code in your repository in Django is obvious and well documented. Django collects the strings and adds the translation files to the locale path (per app or per project).

    For the other packages, these strings and translations are located within their directory hierarchy, outside the reach of the makemessages command. Django, on the other hand, goes through all these paths, searching for the first match.

    With this in mind, the easiest and most straightforward way I was able to find to achieve this goal was:

    Create a file in your project (in an app directory or in a common project directory), let’s call it locale_overrides.py and put there the exact string from your dependency (Django or another) that you which to translate:

    from django.utils.translation import gettext_lazy as _
    
    locale_overrides = [
        _("This could sound better."),
        ...
    ]

    Then run manage.py makemessages, translate the new lines in the .po file as you wish, then finally compile your new translations with manage.py compilemessages.

    Since your new translations are found first, when your app is looking for them, they will be picked instead of the “original” ones.

    For tiny adjustments, this method works great. When the amount of content starts growing too much, a new approach might be more appropriate, but that will be a topic for another time.

  • Security.txt in the wild: 2025 edition

    One year ago, I checked the top 1 million “websites” for a security.txt file and then posted the results in this blog. As it was described at the time, I used a tool written by someone else who had already run this “experiment” in 2022.

    You can look at the post, if you are keen to know what is this file, why I was curious about the adoption numbers and what were last year’s results.

    As promised, I am collecting and publishing this information on the blog again this year. Yes, I did remember or, more precisely, my calendar app did.

    The first step was to download the same software again, and the second step was to download the most recent list of the top 1 million domains from the same source.

    Then, after consuming energy for a few of hours and wasting some bandwidth, the results were the following:

    TotalChange from 2024
    Sites scanned999968-0,003%
    Sites with a valid file1773-81%
    Sites with an invalid file12140+454%
    Sites without a file986055-0,25%
    ContactPolicyHiringEncryptionExpiry
    Sites with value120194526310730528480
    Change from 2024+30,3%+23,2%+21,2%+15,2%+70,9%

    Overall, there was an expected increase in usage, however the change from last year is again underwhelming. The number of domains with the file went from, 11501 to 13913, which is a minimal improvement.

    The valid/invalid numbers seem to be messed up, but this could be due to the software being outdated with the spec. I didn’t waste too much time on this.

    Even considering and ignoring the limitations described in the original author’s post, and ignoring the valid file detection issue, I think these results might not reflect the reality, due to the huge number of errors found in the output file.

    Overall, adoption seems to be progressing, but it still seems very far from being something mainstream.

    If I do this next year, perhaps it will be better to use a different methodology and tools, so I can obtain more reliable results.

    Fediverse Reactions
  • Status of old PyPI projects: archived

    Since late January, the python package index (PyPI) supports archiving projects/packages. This is, in fact, a very welcome feature, since it clearly tells without any doubt when a package is no longer maintained and will not receive any further updates.

    It makes it easier for the person looking for packages, to know which ones deserve a closer inspection and which ones are there abandoned, polluting the results.

    Previously, the only viable way to retire a package was by adding a disclaimer to the README, and let it sit there indefinitely, being treated just like the other active packages.

    “You had the option of deleting the package”, you might say. Yes, but as I explained in a previous post, this is dangerous and should be avoided. So, archiving is in my view the best course of action when a person no longer wants to maintain their published packages and projects.

    With this in mind, this week I decided to do my part and archive old packages that I had published for different reasons and were there abandoned for years. These were:

    • mdvis: a small package I wrote many years ago, mostly to learn how to publish things on PyPI.
    • auto-tune: something I was about to start working on for a previous employer and that was cancelled at the last minute.
    • django-cryptolock: an experiment done for a previous client. It tried to implement an existing proposal for an authentication scheme, using Monero wallets.
    • monero-python: a few years ago, during my day-to-day work, this package was removed (then renamed by the original author). At the time, it was a direct dependency for many projects and tools, which meant a malicious actor could have taken it and compromise those systems. As a precaution, I grabbed the open name. It has been there empty ever since.

    Now it is your turn.

    After a sufficient number of packages get marked as archived, we can hope for some enhancements to the search functionality of PyPI. Namely, a way of filtering out archived packages from the results and a visual marker for them in the list view. One step at a time.

    Fediverse Reactions
  • Why isn’t my translation showing up?

    Here we go again for another post of this blog’s irregular column, entitled Django’s Friday Tips. Today let’s address a silent issue, that any of you that have formerly worked with internationalization (i18n) almost certainly already faced.

    You add a string that must be translated:

    from django.utils.translation import gettext_lazy as _
    
    some_variable = _("A key that needs translation")

    You then execute the manage.py makemessages --locale pt command, go to the generated django.po file and edit the translation:

    msgid "A key that needs translation"
    msgstr "Uma chave que precisa de tradução"

    You compile (manage.py compilemessages --locale pt), and proceed with your work.

    A few moments later, when checking the results of your hard effort… nothing, the text is showing the key (in English).

    Time to double-check the code (did I forgot the gettext stuff?), the translation file, the I18n settings, etc. What the hell?

    Don’t waste any more time, most likely the translation is marked as fuzzy, like this:

    #: myapp/module.py:3
    #, fuzzy
    #| msgid "Another key that needed translation"
    msgid "A key that needs translation"
    msgstr "Uma chave que precisa de tradução"

    You see, you didn’t notice that #, fuzzy line and kept it there. The command that compiles those translations ignores the messages marked as fuzzy.

    So the solution is to remove those extra 2 lines, or to compile with the --use-fuzzy⁣ flag. That’s it, compile, and you should be able to proceed with the problem solved.

    Fediverse Reactions
  • The books I enjoyed the most in 2024

    Another year went by, and another batch of books was consumed. Just like I did last year, I want to share the ones that I enjoyed the most.

    But what kind of metric is that? Truth be told, it is not an objective one. Last year, I clearly described it like this:

    I don’t mean they are masterpieces or references in a given field, what I mean is that I truly enjoyed the experience. It could be because of the subject, the kind of book, the writing style or for any other reason.

    What matters is that I was able to appreciate the time I spent reading them.

    And I still think it is what truly matters.

    So this year I will repeat the dose — two more books that were entirely worth the money.

    “Broken Money”, by Lyn Alden

    This is a book about money (surprise, surprise). Not in the usual sense of telling the reader how to earn it or on how to spend it. The focus is instead on what it is, what forms of money existed throughout history, how it was used and how each of those forms failed to fulfil their purpose at a given time.

    As the book progresses, it introduces the reader to important financial concepts, practices, and institutions that were born to fulfil certain needs, or to accomplish a desired outcome. It discusses their purposes and their problems.

    When describing the current state of affairs, the author focuses on how the existing financial system doesn’t serve all people equally. Example after example, we can see how some benefit from it, while others are harmed by it, over and over again.

    The book ends by taking a look at the internet age and exploring “alternatives” that are surfacing on the horizon.

    It had a real impact on how I see money and the financial system.

    “Masters of Doom”, by David Kushner

    Another great book that was a joy to read was “Masters of Doom”, and I guess that every kid from the 90s that touched a PC during that time will know at least one game that is mentioned there.

    It tells the story about the people behind “id Software” and their journey throughout most of the decade while they developed and released games such as Commander Keen, Wolfenstein 3D, Doom, and Quake.

    As a kid, I remember playing and enjoying some of those games, many hours of fun and excitement. I was too young to know or follow the stories and the dramas of the game development industry, but I definitely hold great memories of the outcome.

    In the book you will find how they met, the ups, the downs, the drama, etc. You know, the whole rollercoaster that any new and successful company eventually goes through.

    While many other people were involved in making those games, and eventually make the company prosper, the two main characters in this story are John Carmack and John Romero. With very distinct personalities, it is remarkable how far they were able to take this endeavor together.

    If you lived during that time, I guess you will enjoy the book.

    Fediverse Reactions
  • Optimizing mastodon for a single user

    I’ve been participating in the Fediverse through my own mastodon instance since 2017.

    What started as an experiment to test new things, focused on exploring decentralized and federated alternatives for communicating on top of the internet, stuck. At the end of 2024, I’m still there.

    The rhetoric on this network is that you should find an instance with a community that you like and start from there. At the time, I thought that having full control over what I publish was a more interesting approach.

    Nowadays, there are multiple alternative software implementations that you can use to join the network (this blog is a recent example) that can be more appropriate for distinct use cases. At the time, the obvious choice was Mastodon in single user mode, but ohh boy it is heavy.

    Just to give you a glimpse, the container image surpasses 1 GB in size, and you must run at least 3 of those, plus PostgreSQL database, Redis broker and optionally elastic search.

    For a multiple user instance, this might make total sense, but for a low traffic, single user service, it is too much overhead and can get expensive.

    A more lightweight implementation, would fit my needs much better, but just thinking about the migration process gives me cold feet. I’m also very used to the apps I ended up using to interact with my instance, which might be specific for Mastodon’s API.

    So, I decided to go in a different direction, look for the available configurations that would allow me to reduce the weight of the software on my small machine, in order words, run Mastodon on the smallest machine possible.

    My config

    To achieve this goal and after taking a closer look at the available options, these are the settings that I ended up changing over time that produced some improvements:

    • ES_ENABLED=false — I don’t need advanced search capabilities, so I don’t need to run this extra piece of software. This was a decision I made on day 1.
    • STREAMING_CLUSTER_NUM=1 — This is an old setting that manages the number of processes that deal with web sockets and the events that are sent to the frontend. For 1 user, we don’t need more than one. In recent versions, this setting was removed, and the value is always one.
    • SIDEKIQ_CONCURRENCY=1 — Processing background tasks in a timely fashion is fundamental for how Mastodon works, but for an instance with a single user, 1 or 2 workers should be more than enough. The default value is 5, I’ve used 2 for years, but 1 should be enough.
    • WEB_CONCURRENCY=1 — Dealing with a low volume of requests, doesn’t too many workers, but having at least some concurrency is important. We can achieve that with threads, so we can keep the number of processes as 1.
    • MAX_THREADS=4 — The default is 5, I reduced it to 4, and perhaps I can go even further, but I don’t think I would have any significant gains.

    To save some disk space, I also periodically remove old content from users that live on other servers. I do that in two ways:

    • Changed the media cache retention period to 2 and user archive retention period to 7, in Administration>Server Settings>Content Retention.
    • Periodically run the tootctl media remove and tootctl preview_cards remove commands.

    Result

    In the end, I was able to reduce the resources used by my instance and avoid many of the alerts my monitoring tools were throwing all the time. However, I wasn’t able to downsize my machine and reduce my costs.

    It still requires at least 2 GB of RAM to run well, even though with these changes, there’s much more breathing room.

    If there is a lesson to be learned or a recommendation to be done with this post, it is that if you want to participate in the Fediverse, while having complete control, you should opt for a lighter implementation.

    Do you know any other quick tips that I could try to optimize my instance further? Let me know.

  • An experiment in fighting spam on public forms using “proof of work”

    Spam is everywhere. If you have an email account, a mailbox, a website with comments, a cellphone, a social media account, a public form, etc. We all know it, it is a plague.

    Over the years, there have been multiple attempts to fight spam, with various degrees of success, some more effective than others, some with more side effects, some simple, some complex, some proprietary…

    Online, one of the most successful approaches has been captchas. Just like spam, these little and annoying challenges are everywhere. I’m not a fan of captchas, for many reasons, but mainly because the experience for humans is often painful, they waste our time, and they make us work for free.

    So, for public forms on my websites, I usually don’t use any sort of measure to fight spam. The experience for humans is straightforward, and later I deal with the spam myself.

    It isn’t too much work, but these are also just several low traffic websites. Obviously, this doesn’t “scale”.

    So, in March, I decided to do a little experiment, to try to find a way of reducing the amount of spam I receive from these forms. I would rather not attach an external service, use a captcha or anything that could change the experience of the user.

    Proof-of-work enters the room

    It is public knowledge that the “Proof of work” mechanism, used in Bitcoin, started as one of these attempts to fight email spam. So this is not a novel idea, it has decades.

    The question is, would some version of it work for my public forms? The answer is yes, and in fact, it didn’t take much to get some meaningful results.

    So, without spending too much effort, I added a homegrown script to one of my forms that would do the following before the form is submitted to the server:

    fields = (grab all form fields)
    content = (concatenate contents of all fields)
    difficulty = (get the desired number of leading zeros)
    loop
      nonce = (get new nonce value)
      hash_input = (concatenate nonce with content)
      hash = (SHA256 digest of hash_input)
      if hash meets difficulty
        Add nonce to form
        break
      end if
    end loop
    submit form

    On the server side, I just calculate the hash of the contents plus the nonce, and check if it matches the desired difficulty. If it doesn’t, I discard the message.

    The mechanism described above, obviously, has serious flaws and could be bypassed in multiple ways (please don’t rely on it). It is not even difficult to figure out. But for the sake of the experiment, this was the starting point.

    I just tuned a bit the difficulty parameter to be something a human couldn’t distinguish from a slow website and that could be impactful to bots (nothing scientific).

    The results

    Six months went by, and the results are better than I initially expected. Specially because I thought I would have to gradually improve the implementation as the spammers figure out how to bypass this little trick.

    I also assumed that at least some of them would use headless browsers, to deal with all the JavaScript that websites include nowadays, and automatically bypass this protection.

    In the end, I was dead wrong. So this little form went from ~15 spam submissions every single day, to 0. During the whole 6-month period, a total of 1 spam message went through.

    But you might ask, has the traffic stopped? Did the bots stop submitting?

    Chart of the number of form submissions during the last 30 days.

    As the chart above shows, no, the spammers continued to submit their home cooked spam. During the last 30 days, a total of 210 POST requests were sent, but the spam was just filtered out.

    So, what did I learn with this experiment? I guess that the main lesson was that these particular spammers, are really low-effort creatures. You raise the bar a little, and they stop being effective.

    The second lesson was that we can definitely fight spam without relying on third parties and without wasting our users/visitors time. A better implementation is undoubtedly required, but for low traffic websites, it might be a good enough solution. We just need an off-the-shelf component that is easy to integrate (I guess some already exist, I just didn’t spend too much time exploring).

    I’m also curious about other alternatives, such as requiring a micropayment to submit the form (like a fraction of a cent). Until now, this would require a third-party integration and be a pain in every way, but with the Bitcoin’s lightning network becoming ubiquitous, this might become a viable alternative (there are similar projects out there, that work great).

  • Hawkpost enters “maintenance only” mode

    In practice this already happened a couple of years ago, now we are just making it official.

    For those who don’t know, Hawkpost is a side project that I started while at Whitesmith back in 2016 (8+ years ago). I’ve written about it here in the blog on several occasions.

    To sum it up, it is a tool made to solve a problem that at the time I frequently saw in the wild, while doing the typical agency/studio work. Clients and most people shared credentials and other secrets for their projects through insecure means (in plain text on chats, emails, etc.). It bothered me to the point of trying to figure out a solution, that was both, easy to use for me and my coworkers, and obvious/transparent to people who simply don’t care about it.

    Awareness about encryption at the time, while making rapid progress, was not as widespread as it is today. Tools were not as easy to use.

    Hawkpost ended up being very useful for many people. It didn’t have to be perfect, it just needed to improve the existing state of affairs, as it did.

    Eight years after, things have changed. I no longer do agency work, I’ve changed workplaces, awareness improved a lot, and many other tools appeared in the market. Hawkpost’s development has stalled, and while it still has its users, we haven’t seen much overall interest to keep working on it.

    To be sincere, I don’t use it anymore. That’s because I have other tools at my disposal that are much better for the specific use-cases they address, and perhaps also better for Hawkpost’s original purpose.

    Here are some examples:

    • For sharing credentials within a non-technical team (if you really must): Use a proper team password manager such as Bitwarden or 1Password.
    • For sharing files and other sizable data: one good alternative is to use send (the successor of Firefox Send). It also has an official CLI client.
    • For sharing and working on encrypted documents: CryptPad has a whole range of applications where data is encrypted E2E.

    So, this week, we released the version 1.4.0 of Hawkpost. It fixes some bugs, updates major dependencies and makes sure the project is in good shape to continue to receive small updates. The full list of changes can be found here.

    However, new features or other big improvements won’t be merged from now on (at least for the foreseeable future). The project is in “maintenance only” mode. Security issues and anything that could make the project unusable, will be handled, but nothing else.

  • Is it “/.well-known/”?

    Ironically, according to my experience, the .well-known directory doesn’t do justice to its name. Even in use cases that would fit nicely in its original purpose. 

    But I’m getting a bit ahead of myself. Let’s first start with what it is, then move to discuss where it’s used. But we’ll do this rapidly, otherwise this post will get boring really fast.

    Let’s look at what the RFC has to say:

    Some applications on the Web require the discovery of information about an origin before making a request.

    … designate a “well-known location” for data or services related to the origin overall, so that it can be easily located.

    … this memo reserves a path prefix in HTTP, HTTPS, WebSocket (WS), and Secure WebSocket (WSS) URIs for these “well-known locations”, “/.well-known/”. Future specifications that need to define a resource for such metadata can register their use to avoid collisions and minimise impingement upon origins’ URI space.

    So, briefly, it is a standard place or set of standard URIs, that can be used by people or automated processes to obtain (meta)data about resources of the domain in question. The purpose of the requests and the content of the responses doesn’t even need to be related to the web.

    The RFC introduces the need for this “place”, by providing the example of the “Robots Exclusion Protocol” (robots.txt), which is a good example… that paradoxically doesn’t use the well-known path.

    Now that the idea is more or less settled, here are other examples of cool and useful protocols that actually make use of it.


    ACME HTTP Challenge

    The use-case, here, is that an external entity needs to verify you own the domain. So to prove it, you place a unique/secret “token” in a certain path, in order for this entity to make a request and check that is true.

    Many Let’s Encrypt tools, make use of this approach.

    Security.txt

    This one is a bit obvious, and I already addressed it in previous posts (here and here). It is just a standard place to put your security contacts, so that researchers can easily find all the data they need to alert you about any of their findings.

    Web Key Directory (PGP)

    Traditionally, OpenPGP relied on “key servers” and the web of trust, for people to fetch the correct public keys for a given email address. With the “Web Key Directory”, domain owners can expose the correct and up-to-date public keys to associated addresses in a well-known path. Then, email clients can quickly fetch just by knowing the address itself.

    Lightning Address / LN URL Pay

    Sending on-chain Bitcoin to pay for a beer at the bar, or to send a small tip, is not that useful or practical at all (time and long addresses will get in the way).

    For small payments in Bitcoin, the lightning network is what you should use. While instantaneous, this approach requires a small dance between both wallets (showing QR code, etc.)

    Using a lightning address (which is essentially the same as an email address), solves this problem. You type the address and send the funds, done. Your wallet takes care of figuring the rest. To accomplish that, it fetches all the information from a standard place in the /.well-known/ path.

    I wrote about it before, and if you wish, you can buy me a beer by sending a few “sats” to my “email” address.

    • Suffix: lnurlp
    • Details 1 / 2

    Password-Change

    This feature allows password managers to know where the form to change the password of a given website is located. Allows users to go straight to that place from the password manager’s UI.

    Digital Asset Link

    Have you ever touched on a link while using your Android smartphone and received a suggestion to open it in a certain app instead of the predefined web browser?

    Me too, now you know how it is done.


    The whole list of recognized well-known URI’s can be found here. But I guess there are way more suffixes in use, since 2 of the 6 mentioned above are not there but are widely used within their ecosystems.

    That’s it, looking at the list above gives us a glimpse of how certain things are implemented and a few good ideas of things we could add to our domains/websites.

  • “Extracting wisdom” from conference videos

    PyCon US happened in May, this month, the 154 videos gradually started being published on YouTube. Between now and then many other interesting conferences took place. That’s a lot of talks, presentations, and content to be digested.

    The truth is, I and most people, won’t watch it all since our time is limited. One option, is to look at the titles and descriptions, then guess what might be the most interesting content. This is a gamble, and my experience tells me that I often get disappointed with my picks.

    This process can be repeated for the dozens of conferences a person is interested in. What if we could have a way of:

    • Finding the best videos to watch, based on our “needs”.
    • Extracting the main teachings of all content.
    • Store it is a consumable / searchable way.

    It sounds like a lot of work. But fortunately, in 2024 our digital assistants can help us with that.

    A couple of months ago, I wrote a blog post about how I run these AI tools on my device without the need of leaking any data to external services. Today, I’ll describe how I used them to help me with the task of extracting the key information and learnings from all the videos I won’t be able to watch.

    In the process, I will also share the results publicly. So, let’s get started.

    The tools

    As I’ve mentioned in the other post, I use “ollama” to run the AI models locally. For this task, since it will require the digestion of a lot of content and given my machine has modest resources, I will rely on llama3:8b. A bigger model might render better results, but it would take forever.

    The next steps are to provide the model with the content that needs to be analyzed, and then precisely instruct it on what to do. For this part of the task, I will rely on “fabric”.

    “Fabric” provides a tool to download the transcripts from YouTube videos and contains a collection of carefully curated prompts (aka instructions) for the AI models to do a multitude of tasks. For my goal, I will use the extract_wisdom pattern/prompt.

    Example output

    To demonstrate what can be achieved by letting the model analyze a conference video, following the instructions of a good prompt, I’m including an example.

    I used “extract_wisdom” on the following video: Iterate, Iterate, Iterate! (Moshe Zadka) — PyTexas 2024

    I obtained the following output:

    **SUMMARY**
    Mosha ZKA presents on iterators and generators in Python. He discusses how to use them, their benefits, and common pitfalls.
    
    **IDEAS:**
    • Iterators are objects that define a sequence of values.
    • Generators are special types of functions that can be used as iterators.
    • The `iter()` function returns an iterator object.
    • Use the `next()` method to get the next value from an iterator.
    • Avoid using infinite loops with generators.
    • Use iterators and generators to process large datasets efficiently.
    • Python's built-in functions, such as `range()`, are implemented using iterators.
    • You can use the `enumerate()` function to iterate over both indices and values.
    • The `zip()` function takes iterables of equal length and returns an iterator.
    • Use the `groupby()` function from the `itertools` module to group items based on a common attribute.
    • Avoid using `for` loops with generators, as they can lead to infinite loops.
    
    **INSIGHTS:**
    • Iterators and generators are essential tools for working with large datasets in Python.
    • Using iterators and generators can simplify your code and improve performance.
    • Be mindful of the differences between iterators and generators, and use them accordingly.
    • Avoid using `for` loops with generators, as they can lead to infinite loops.
    • Use the `iter()` function to create an iterator object from a sequence or other iterable.
    • Use the `next()` method to get the next value from an iterator.
    • Iterators and generators can help you write more efficient and scalable code.
    
    **QUOTES:**
    • "Iterators are objects that define a sequence of values."
    • "Generators are special types of functions that can be used as iterators."
    • "The `iter()` function returns an iterator object."
    
    **HABITS:**
    • Set aside time each day to learn new programming concepts.
    • Read books on Python programming and related topics.
    • Practice writing code using iterators and generators.
    • Join online communities and forums for Python developers.
    
    **FACTS:**
    • The first version of Python was released in 1991.
    • Guido van Rossum, the creator of Python, wanted to create a scripting language that was easy to learn and use.
    • Python is now one of the most popular programming languages worldwide.
    • The `itertools` module in Python provides many useful functions for working with iterators.
    
    **REFERENCES:**
    • "Python Crash Course" by Eric Matthes
    • "Automate the Boring Stuff with Python" by Al Sweigart and others
    
    **ONE-SENTENCE TAKEAWAY:**
    Iterators and generators are powerful tools that can simplify your code, improve performance, and help you write more scalable programs.
    
    **RECOMMENDATIONS:**
    • Use iterators and generators to process large datasets efficiently.
    • Avoid using `for` loops with generators, as they can lead to infinite loops.
    • Practice writing code using iterators and generators.
    • Read books on Python programming and related topics.
    • Join online communities and forums for Python developers.

    Evaluating the results

    After going through the notes and then picking some videos, it is clear that the extracted content is not even close to being 100% spot on. I’ve noticed things such as:

    • Focusing on only a small part of the video.
    • Highlighting superfluous stuff, while missing content that I would classify as important.
    • Misinterpretation of what has been said (I only found one occurrence of this, but I assume there will be more. It is in the example I’ve shown above, try to find it).

    Nevertheless, I still found the results helpful for the purpose I was aiming for. I guess that some issues might be related to:

    1. The model that I’ve chosen. Perhaps using a bigger one would render better notes.
    2. The fact that the approach relies on transcripts. The model misses the information that is communicated visually (slides, demos, etc.).

    This approach would definitely provide better results when applied to written content and podcast transcripts.

    The repository

    I’ve written a quick script to run this information extraction on all videos of a YouTube playlist (nothing fancy that’s worth sharing). I’ve also created a repository where I store the results obtained when I run it for a conference playlist (PyCon is still not there since all videos were not released yet).

    Every time I do this for a conference I’m interested in, I will share these automatically generated notes there.

    You are welcome to ask for a specific conference to be added. If it is on Youtube, it is likely I can generate them. Just create a new issue in the repository.

    Fediverse Reactions
  • Ways to have an atomic counter in Django

    This week, I’m back at my tremendously irregular Django tips series, where I share small pieces of code and approaches to common themes that developers face when working on their web applications.

    The topic of today’s post is how to implement a counter that isn’t vulnerable to race conditions. Counting is everywhere, when handling money, when controlling access limits, when scoring, etc.

    One common rookie mistake is to do it like this:

    model = MyModel.objects.get(id=id)
    model.count += 1
    model.save(update_fields=["count"])

    An approach that is subject to race conditions, as described below:

    • Process 1 gets count value (let’s say it is currently 5)
    • Process 2 gets counts value
    • Process 1 increments and saves
    • Process 2 increments and saves

    Instead of 7, you end up with 6 in your records.

    On a low stakes project or in a situation where precision is not that important, this might do the trick and not become a problem. However, if you need accuracy, you will need to do it differently.

    Approach 1

    with transaction.atomic():
        model = (
            MyModel.objects.select_for_update()
            .get(id=id)
        )
        model.count += 1
        model.save(update_fields=["count"])

    In this approach, when you first fetch the record, you ask for the database to lock it. While you are handling it, no one else can access it.

    Since it locks the records, it can create a bottleneck. You will have to evaluate if fits your application’s access patterns. As a rule of thumb, it should be used when you require access to the final value.

    Approach 2

    from django.db.models import F
    
    model = MyModel.objects.filter(id=id).update(
        count=F("count") + 1
    )

    In this approach, you don’t lock any values or need to explicitly work inside a transaction. Here, you just tell the database, that it should add 1 to the value that is currently there. The database will take care of atomically incrementing the value.

    It should be faster, since multiple processes can access and modify the record “at the same time”. Ideally, you would use it when you don’t need to access the final value.

    Approach 3

    from django.core.cache import cache
    
    cache.incr(f"mymodel_{id}_count", 1)

    If your counter has a limited life-time, and you would rather not pay the cost of a database insertion, using your cache backend could provide you with an even faster method.

    The downside, is the level of persistence and the fact that your cache backend needs to support atomic increments. As far as I can tell, you are well served with Redis and Memcached.

    For today, this is it. Please let me know if I forgot anything.

  • Are Redis ACL password protections weak?

    Earlier this year, I decided to explore Redis functionality a bit more deeply than my typical use-cases would require. Mostly due to curiosity, but also to have better knowledge of this tool in my “tool belt”.

    Curiously, a few months later, the whole ecosystem started boiling. Now we have Redis, Valkey, Redict, Garnet, and perhaps a few more. The space is hot right now and forks/alternatives are popping up like mushrooms.

    One common thing inherited from Redis is storing user passwords as SHA256 hashes. When I learned about this, I found it odd, since it goes against common best practices. The algorithm is very fast to brute force, does not protect against the usage of rainbow tables, etc.

    Instead of judging too fast, a better approach is to understand the reasoning for this decision, the limitations imposed by the use-cases and the threats such application might face.

    But first, let’s take a look at a more standard approach.

    Best practices for storing user passwords

    According to OWASP’s documentation on the subject, the following measures are important for applications storing user’s passwords:

    1. Use a strong and slow key derivation function (KDF).
    2. Add salt (if the KDF doesn’t include it already).
    3. Add pepper

    The idea for 1.is that computing a single hash should have a non-trivial cost (in time and memory), to decrease the speed at which an attacker can attempt to crack the stolen records.

    Adding a “salt” protects against the usage of “rainbow tables”, in other words, doesn’t let the attacker simply compare the values with precomputed hashes of common passwords.

    The “pepper” (a common random string used in all records), adds an extra layer of protection, given that, unlike the “salt”, it is not stored with the data, so the attacker will be missing that piece of information.

    Why does Redis use SHA256

    To store user passwords, Redis relies on a vanilla SHA256 hash. No multiple iterations for stretching, no salt, no pepper, nor any other measures.

    Since SHA256 is meant to be very fast and lightweight, it will be easier for an attacker to crack the hash.

    So why this decision? Understanding the use-cases of Redis gives us the picture that establishing and authenticating connections needs to be very, very fast. The documentation is clear about it:

    Using SHA256 provides the ability to avoid storing the password in clear text while still allowing for a very fast AUTH command, which is a very important feature of Redis and is coherent with what clients expect from Redis.

    Redis Documentation

    So this is a constraint that rules out the usage of standard KDF algorithms.

    For this reason, slowing down the password authentication, in order to use an algorithm that uses time and space to make password cracking hard, is a very poor choice. What we suggest instead is to generate strong passwords, so that nobody will be able to crack it using a dictionary or a brute force attack even if they have the hash.

    Redis Documentation

    So far, understandable. However, my agreement ends in the last sentence of the above quote.

    How can it be improved?

    The documentation leaves to the user (aka server administrator) the responsibility of setting strong passwords. In their words, if you set passwords that are lengthy and not guessable, you are safe.

    In my opinion, this approach doesn’t fit well with the “Secure by default” principle, which, I think, is essential nowadays.

    It leaves to the user the responsibility to not only set a strong password, but to also ensure that the password is almost uncrackable (a 32 bytes random string, in their docs). Experience tells me that most users and admins won’t be aware of it or won’t do it.

    Another point made to support the “vanilla SHA256” approach is:

    Often when you are able to access the hashed password itself, by having full access to the Redis commands of a given server, or corrupting the system itself, you already have access to what the password is protecting: the Redis instance stability and the data it contains.

    Redis Documentation

    Which is not entirely true, since ACL rules and users can be set as configuration files and managed externally. These files contain the SHA256 hashes. This means that in many setups and scenarios, the hashes won’t live only on the Redis server. This kind of configuration will be managed and stored elsewhere.

    I’m not the only one who thinks the current approach is not enough, teams implementing compatible but alternative implementations seem to share the concerns.

    So, after so many words and taking much of your precious time, you might ask, “what do you propose?”.

    Given the requirements for extremely fast connections and authentication, the first and main improvement would be to start using a “salt”. It is simple and won’t have any performance impact.

    The “salt” would make the hashes of not so strong passwords harder to crack, given that each password would have an extra random string that would have to be considered individually. Furthermore, this change could be made backwards compatible and added to existing external configuration files.

    Then, I would consider picking a key stretching approach or a more appropriate KDF to generate the hashes. This one would need to be carefully benchmarked, to minimize the performance impact. A small percentage of the time it takes for the whole process of initiating an authenticated connection, could be a good compromise.

    I would skip for now the usage of a “pepper”, since it is not clear how this could be done and managed from the user’s side. Pushing this responsibility to the user (Redis server operator), would create more complexity than it would be beneficial.

    An alternative approach, that could also be easy to implement and would be more secure than the current one, would be to automatically generate the “password” for the users by default. It would work like regular API keys, since it seems this is how Redis sees them:

    However ACL passwords are not really passwords. They are shared secrets between the server and the client, because the password is not an authentication token used by a human being.

    Redis Documentation

    The code already exists:

    …there is a special ACL command ACL GENPASS that generates passwords using the system cryptographic pseudorandom generator: …

    The command outputs a 32-byte (256-bit) pseudorandom string converted to a 64-byte alphanumerical string.

    Redis Documentation

    So it could be just a matter of requiring the user to explicitly bypass this automatic “API key” generation, to set up his own custom password.

    Summing it up

    To simply answer the question asked in the title: yes, I do think the user passwords could be better protected.

    Given the requirements and use-cases, it is understandable that there is a need to be fast. However, Redis should do more to protect the users’ passwords or at least ensure that users know what they are doing and pick an almost “uncrackable” password.

    So I ended up proposing:

    • An easy improvement: Add a salt.
    • A better improvement: Switch to a more appropriate KDF, with low work factor for performance reasons.
    • A different approach: Automatically generate by default a strong password for the ACL users.
    Fediverse Reactions
  • Local AI to the rescue

    The last couple of years have been dominated by the advancements in the Artificial Intelligence (AI) field. Many of us witnessed and are currently experiencing some sort of renaissance of AI.

    It started with generated images from prompts, then it was all types of written content, and in the last few weeks we’ve seen astonishing videos completely generated from a prompt.

    Simultaneously, many other more specific tasks and fields started seeing the outcomes of specialized usage of these technologies.

    Like any other tool ever produced by human ingenuity, it can be used for “good” and for “bad”. However, that’s not what I want to discuss in this post, just a general observation.

    Like many others, I felt the curiosity to experiment with these new tools, to see where they can help me in my daily life, either at work or at a more personal level.

    One thing that quickly caught my attention, was that many of the most well-known products are only accessible through the internet. You send your inputs to “Company X” servers, they run the trained models on their end, and eventually the result is transmitted back to you.

    While understandable, given that the hardware requirements for AI stuff are massive, I find unsettling the continuation of the trend of all your data and interactions being shared with a remote company.

    Let’s take programming as a simple example, an area where some companies are betting strongly on AI helpers, such as GitHub’s Copilot. I think many employers wouldn’t be too happy knowing that their proprietary code was being leaked to a third party through developer interactions with an assistant.

    Even though the above example might not apply to all, it is a real concern and in many places, adopting such a tool would require few discussions with the security and legal teams.

    That is why I turned my attention to how can a person run this stuff locally. The main obstacles to this approach are:

    • The models that are freely available might not be the best ones.
    • Your hardware might not be powerful enough

    Regarding the first problem, a few companies already released models that you can freely use, so we are good. They might not be as good as the big ones, but they don’t need to tell you all the right answers, nor do the job for you, to be useful in some way. They just need to help you break barriers with less effort, as it is shown in a recent study:

    Instead, it lies in helping the user to make the best progress toward their goals. A suggestion that serves as a useful template to tinker with may be as good or better than a perfectly correct (but obvious) line of code that only saves the user a few keystrokes.

    This suggests that a narrow focus on the correctness of suggestions would not tell the whole story for these kinds of tooling.

    Measuring GitHub Copilot’s Impact on Productivity

    The hardware issue is a bigger limitation to running more general and bigger models locally, however my experience showed me that smaller or more specific models can also bring value to the table.

    As a proof that this is viable, we have the example of 2 web browsers that started integrating AI functionality, both for different reasons:

    With the case for Local AI on the table, the next question is: how?

    My local setup

    Next I’ll list and describe the tools I ended up with after some research and testing. It is very likely that this setup will change soon, since things are moving really fast nowadays. Nevertheless, presently, they have been working fine for me on all sorts of tasks.

    I mostly rely on four pieces of software:

    • Ollama: To run the Large Language Models (LLM) on my computers and provide a standard API that other apps can use.
    • Continue.dev plugin for my text editor/IDE: it presents a nice interface to the LLMs and easily attaches context to the session.
    • ImaginAIry: For generating images and illustrations. It can also generate video, but I never explored that part.
    • Fabric: An tool that provides “prompts” for common tasks you would like the AI to do for you.

    All of them work well, even on my laptop that doesn’t have a dedicated GPU. It is much slower than on my desktop, which is much more powerful, but usable.

    To improve that situation, I installed smaller models on the laptop, for example, codellama:7b instead of the codellama:34b and so on.

    And this is it for now, if you have other suggestions and recommendations for local AI tools that I should try, please let me know. I’m well aware that better things are showing up almost every day.

  • Security.txt in the wild

    A few years ago, I covered here in the blog the “security.txt spec”. A standard place with the security related contacts, designed to help researchers, and other people, find the right contacts to report vulnerabilities and other problems.

    At the time, I added it to my personal domain, as an example.

    When I wrote the post, the spec was still fairly recent, so as expected it wasn’t widely adopted and only the more security conscious organizations did put it into place.

    Since then, as part of my security work, I implemented it for several products, and the results were good. We received and triaged many reports that were sent to the correct addresses since day one.

    Many people, who put the security.txt file in place, complain about the amount of low effort reports that are sent their way. I admit this situation is not ideal. However, I still think it is a net positive, and the problem can be minimized by having a good policy in place and a streamlined triage process.

    While I always push for the implementation of this method on the products I work on, I have very little information about how widespread the adoption of this “spec” is.

    The topic is very common in certain “hacker” forums, but when I talk to people, the impression I get is that this is an obscure thing.

    The website, findsecuritycontacts.com, relies on security.txt to get its information. It also monitors the top 500 domains every day to generate some stats. The results are disappointing, only ~20% of those websites implement it correctly.

    I remember reading reports that covered many more websites, but recently, I haven’t seen any. With a quick search, I was able to find this one.

    It was written in 2022, so the results are clearly dated. On the bright side, the author published the tool he used to gather the data, which means we can quickly gather more recent data.

    So, to kill my curiosity, I downloaded the tool, grabbed the up-to-date list of the top 1 million websites from tranco-list.eu, gathered the same data and with a few lines of python code I obtained the following results:

    • Total sites scanned: 999992
    • Sites with a valid file: 9312 (~0.93%)
    • Sites with an invalid file: 2189 (~0.22%)
    • Sites without a file: 988491 (~98.85%)
    ContactPolicyHiringEncryptionExpiry
    Sites with value92183674256426504960

    The results are a bit underwhelming, I’m not sure if it is a flaw in the software, or if this is a clear picture of the reality.

    On the other hand, if we compare with the results that the original author obtained, this is just about a 3-fold improvement during the period of 1 year and a half. Which is a good sign.

    Next year, if I don’t forget, I will run the experiment again, to check the progress once more.

    Fediverse Reactions