Categories
Python Software Development

Django Friday Tips: Password validation

This time I’m gonna address Django’s builtin authentication system, more specifically the ways we can build custom improvements over the already very solid foundations it provides.

The idea for this post came from reading an article summing up some considerations we should have when dealing with passwords. Most of those considerations are about what controls to implement (what “types” of passwords to accept) and how to securely store those passwords. By default Django does the following:

  • Passwords are stored using PBKDF2. There are also other alternatives such as Argon2 and bcrypt, that can be defined in the setting PASSWORD_HASHERS.
  • Every Django release the “strength”/cost of this algorithm is increased. For example, version 3.1 applied 216000 iterations and the last version (3.2 at the time of writing) applies 260000. The migration from one to another is done automatically once the user logs in.
  • There are a set of validators that control the kinds of passwords allowed to be used in the system, such as enforcing a minimum length. These validators are defined on the setting AUTH_PASSWORD_VALIDATORS.

By default when we start a new project these are the included validators :

  • UserAttributeSimilarityValidator
  • MinimumLengthValidator
  • CommonPasswordValidator
  • NumericPasswordValidator

The names are very descriptive and I would say a good starting point. But as the article mentions the next step is to make sure users aren’t reusing previously breached passwords or using passwords that are known to be easily guessed (even when complying with the other rules). CommonPasswordValidator already does part of this job but with a very limited list (20000 entries).

Improving password validation

So for the rest of this post I will show you some ideas on how we can make this even better. More precisely, prevent users from using a known weak password.

1. Use your own list

The easiest approach, but also the more limited one, is providing your own list to `CommonPasswordValidator`, containing more entries than the ones provided by default. The list must be provided as a file with one entry in lower case per line. It can be set like this:

{
  "NAME": "django.contrib.auth.password_validation.CommonPasswordValidator",
  "OPTIONS": {"password_list_path": "<path_to_your_file>"}
}

2. Use zxcvbn-python

Another approach is to use an existing and well-known library that evaluates the password, compares it with a list of known passwords (30000) but also takes into account slight variations and common patterns.

To use zxcvbn-python we need to implement our own validator, something that isn’t hard and can be done this way:

# <your_app>/validators.py

from django.core.exceptions import ValidationError
from zxcvbn import zxcvbn


class ZxcvbnValidator:
    def __init__(self, min_score=3):
        self.min_score = min_score

    def validate(self, password, user=None):
        user_info = []
        if user:
            user_info = [
                user.email, 
                user.first_name, 
                user.last_name, 
                user.username
            ]
        result = zxcvbn(password, user_inputs=user_info)

        if result.get("score") < self.min_score:
            raise ValidationError(
                "This passoword is too weak",
                code="not_strong_enough",
                params={"min_score": self.min_score},
            )

    def get_help_text(self):
        return "The password must be long and not obvious"

Then we just need to add to the settings just like the other validators. It’s an improvement but we still can do better.

3. Use “have i been pwned?”

As suggested by the article, a good approach is to make use of the biggest source of leaked passwords we have available, haveibeenpwned.com.

The full list is available for download, but I find it hard to justify a 12GiB dependency on most projects. The alternative is to use their API (documentation available here), but again we must build our own validator.

# <your_app>/validators.py

from hashlib import sha1
from io import StringIO

from django.core.exceptions import ValidationError

import requests
from requests.exceptions import RequestException

class LeakedPasswordValidator:
    def validate(self, password, user=None):
        hasher = sha1(password.encode("utf-8"))
        hash = hasher.hexdigest().upper()
        url = "https://api.pwnedpasswords.com/range/"

        try:
            resp = requests.get(f"{url}{hash[:5]}")
            resp.raise_for_status()
        except RequestException:
            raise ValidationError(
                "Unable to evaluate password.",
                code="network_failure",
            )

        lines = StringIO(resp.text).readlines()
        for line in lines:
            suffix = line.split(":")[0]

            if hash == f"{hash[:5]}{suffix}":
                raise ValidationError(
                    "This password has been leaked before",
                    code="leaked_password",
                )

    def get_help_text(self):
        return "Use a different password"

Then add it to the settings.

Edit: As suggested by one reader, instead of this custom implementation we could use pwned-passwords-django (which does practically the same thing).

And for today this is it. If you have any suggestions for other improvements related to this matter, please share them in the comments, I would like to hear about them.

Categories
Personal Random Bits

My picks on open-source licenses

Sooner or later everybody that works with computers will have to deal with software licenses. Newcomers usually assume that software is either open-source (aka free stuff) or proprietary, but this is a very simplistic view of the world and wrong most of the time.

This topic can quickly become complex and small details really matter. You might find yourself using a piece of software in a way that the license does not allow.

There are many types of open-source licenses with different sets of conditions, while you can use some for basically whatever you want, others might impose some limits and/or duties. If you aren’t familiar with the most common options take look at choosealicense.com.

This is also a topic that was recently the source of a certain level of drama in the industry, when companies that usually released their software and source code with a very permissive license opted to change it, in order to protect their work from certain behaviors they viewed as abusive.

In this post I share my current approach regarding the licenses of the computer programs I end up releasing as FOSS (Free and Open Source Software).

Let’s start with libraries, that is, packages of code containing instructions to solve specific problems, aimed to be used by other software developers in their own apps and programs. On this case, my choice is MIT, a very permissive license which allows it to be used for any purpose without creating any other implications for the end result (app/product/service). In my view this is exactly the aim an open source library should have.

The next category is “apps and tools”, these are regular computer programs aimed to be installed by the end user in his computer. For this scenario, my choice is GPLv3. So I’m providing a tool with the source code for free, that the user can use and modify as he sees fit. The only thing I ask for is: if you modify it in any way, to make it better or address a different scenario, please share your changes using the same license.

Finally, the last category is “network applications”, which are computer programs that can be used through the network without having to install them on the local machine. Here I think AGPLv3 is a good compromise, it basically says if the end user modifies the software and let his users access it over the network (so he doesn’t distribute copies of it), he is free to do so, as long as he shares is changes using the same license.

And this is it. I think this is a good enough approach for now (even though I’m certain it isn’t a perfect fit for every scenario). What do you think?

Categories
Personal

And… the blog is back

You might have noticed that the website has been unavailable during the last week (or a bit longer than that), well, the reason is quite simple:

OVH Strasbourg datacenter burning
OVH Strasbourg datacenter burning (10-03-2021)

It took sometime but the blog was finally put online again, new content should be flowing in soon.

And kids, don’t forget about the backups, because the good old Murphy’s law never disappoints:

Anything that can go wrong will go wrong

Wikipedia
Categories
Python

Django Friday Tips: Subresource Integrity

As you might have guessed from the title, today’s tip is about how to add “Subresource integrity” (SRI) checks to your website’s static assets.

First lets see what SRI is. According to the Mozilla’s Developers Network:

Subresource Integrity (SRI) is a security feature that enables browsers to verify that resources they fetch (for example, from a CDN) are delivered without unexpected manipulation. It works by allowing you to provide a cryptographic hash that a fetched resource must match.

Source: MDN

So basically, if you don’t serve all your static assets and rely on any sort of external provider, you can force the browser to check that the delivered contents are exactly the ones you expect.

To trigger that behavior you just need to add the hash of the content to the integrity attribute of the <script> and/or <link> elements in question.

Something like this:

<script src="https://cdn.jsdelivr.net/npm/vue@2.6.12/dist/vue.min.js" integrity="sha256-KSlsysqp7TXtFo/FHjb1T9b425x3hrvzjMWaJyKbpcI=" crossorigin="anonymous"></script>

Using SRI in a Django project

This is all very nice but adding this info manually isn’t that fun or even practical, when your resources might change frequently or are built dynamically on each deployment.

To help with this task I recently found a little tool called django-sri that automates these steps for you (and is compatible with whitenoise if you happen to use it).

After the install, you just need to replace the {% static ... %} tags in your templates with the new one provided by this package ({% sri_static .. %}) and the integrity attribute will be automatically added.

Categories
Software Development

Documentation done right

One critical piece of the software development process that often gets neglected by companies and also by many open-source projects is explaining how it works and how it can be used to solve the problem in question.

Documentation is often lacking and people have an hard time figuring out how they can use or contribute to a particular piece of software. I think most developers and users have faced this situation at least once.

Looking at it from the other side, it isn’t always easy to pick and share the right information so others can hit the ground running. The fact that not everybody is starting from the same point and have the same goal, makes the job a bit harder.

One approach to solve this problem that I like is Divio’s documentation system.

Divio's documentation system explained. Showing the 4 quadrants and their relations.
The components of Divio’s documentation system.

It splits the problems in 4 areas targeting different stages and needs of the person reading the documentation. Django uses this system and is frequently praised for having great documentation.

From a user point of view it looks solid. You should take a look and apply it on your packages/projects, I surely will.

Categories
Personal

10 years

The first post I published on this blog is now 10 years old. This wasn’t my first website or even the first blog, but it’s the one that stuck for the longest time.

The initial goal was to have a place to share anything I might find interesting on the Web, a place that would allow me to publish my opinions on all kinds of issues (if I felt like it) and to be able to publish information about my projects. I think you still can deduce that from the tag line, that remained unchanged ever since.

From the start, being able to host my own content was one of the priorities, in order to be able to control its distribution and ensuring that it is universally accessible to anyone without any locks on how and by whom it should be consumed.

The reasoning behind this decision was related to a trend that started a couple of years earlier, the departure from the open web and the big migration to the walled gardens.

Many people thought it was an inoffensive move, something that would improve the user experience and make the life easier for everyone. But as anything in life, with time we started to see the costs.

Today the world is different, using closed platforms that barely interact with each other is the rule and the downsides became evident: Users started to be spied for profit, platforms decide what speech is acceptable, manipulation is more present than ever, big monopolies are now gate keepers to many markets, etc. Summing up, the information and power is concentrated in fewer hands.

Last week this event set the topic for the post. A “simple chat app”, that uses an open protocol to interact with different servers, was excluded/blocked from the market unilaterally without any chance to defend itself. A more extensive discussion can be found here.

The message I wanted to leave in this commemorative post, is that we need to give another shot to decentralized and interoperable software, use open protocols and technologies to put creators and users back in control.

If there is anything that I would like to keep for the next 10 years, is the capability to reach, interact and collaborate with the world without having a huge corporation acting as middleman dictating its rules.

I will continue to put an effort in making sure open standards are used on this website (such RSS, Webmention, etc) and that I’m reachable using decentralized protocols and tools (such as email, Matrix or the “Fediverse“). It think this is the minimum a person could ask for the next decade.

Categories
Python

Django Friday Tips: Permissions in the Admin

In this year’s first issue of my irregular Django quick tips series, lets look at the builtin tools available for managing access control.

The framework offers a comprehensive authentication and authorization system that is able to handle the common requirements of most websites without even needing any external library.

Most of the time, simple websites only make use of the “authentication” features, such as registration, login and logout. On more complex systems only authenticating the users is not enough, since different users or even groups of users will have access to distinct sets of features and data records.

This is when the “authorization” / access control features become handy. As you will see they are very simple to use as soon as you understand the implementation and concepts behind them. Today I’m gonna focus on how to use these permissions on the Admin, perhaps in a future post I can address the usage of permissions on other situations. In any case Django has excellent documentation, so a quick visit to this page will tell you what you need to know.

Under the hood

Simplified Entity-Relationship diagram of Django's authentication and authorization features.
ER diagram of Django’s “auth” package

The above picture is a quick illustration of how this feature is laid out in the database. So a User can belong to multiple groups and have multiple permissions, each Group can also have multiple permissions. So a user has a given permission if it is directly associated with him or or if it is associated with a group the user belongs to.

When a new model is added 4 permissions are created for that particular model, later if we need more we can manually add them. Those permissions are <app>.add_<model>, <app>.view_<model>, <app>.update_<model> and <app>.delete_<model>.

For demonstration purposes I will start with these to show how the admin behaves and then show how to implement an action that’s only executed if the user has the right permission.

The scenario

Lets image we have a “store” with some items being sold and it also has a blog to announce new products and promotions. Here’s what the admin looks like for the “superuser”:

Admin view, with all models being displayed.
The admin showing all the available models

We have several models for the described functionality and on the right you can see that I added a test user. At the start, this test user is just marked as regular “staff” (is_staff=True), without any permissions. For him the admin looks like this:

A view of Django admin without any model listed.
No permissions

After logging in, he can’t do anything. The store manager needs the test user to be able to view and edit articles on their blog. Since we expect in the future that multiple users will be able to do this, instead of assigning these permissions directly, lets create a group called “editors” and assign those permissions to that group.

Only two permissions for this group of users

Afterwards we also add the test user to that group (in the user details page). Then when he checks the admin he can see and edit the articles as desired, but not add or delete them.

Screenshot of the Django admin, from the perspective of a user with only "view" and "change" permissions.
No “Add” button there

The actions

Down the line, the test user starts doing other kinds of tasks, one of them being “reviewing the orders and then, if everything is correct, mark them as ready for shipment”. In this case, we don’t want him to be able to edit the order details or change anything else, so the existing “update” permissions cannot be used.

What we need now is to create a custom admin action and a new permission that would let specific users (or groups) execute that action. Lets start with the later:

class Order(models.Model):
    ...
    class Meta:
        ...
        permissions = [("set_order_ready", "Can mark the order as ready for shipment")]

What we are doing above, is telling Django there is one more permission that should be created for this model, a permission that we will use ourselves.

Once this is done (you need to run manage.py migrate), we can now create the action and ensure we check that the user executing it has the newly created permission:

class OrderAdmin(admin.ModelAdmin):
    ...
    actions = ["mark_as_ready"]

    def mark_as_ready(self, request, queryset):
        if request.user.has_perm("shop.set_order_ready"):
            queryset.update(ready=True)
            self.message_user(
                request, "Selected orders marked as ready", messages.SUCCESS
            )
        else:
            self.message_user(
                request, "You are not allowed to execute this action", messages.ERROR
            )

    mark_as_ready.short_description = "Mark selected orders as ready"

As you can see, we first check the user as the right permission, using has_perm and the newly defined permission name before proceeding with the changes.

And boom .. now we have this new feature that only lets certain users mark the orders as ready for shipment. If we try to execute this action with the test user (that does not have yet the required permission):

No permission assigned, no action for you sir

Finally we just add the permission to the user and it’s done. For today this is it, I hope you find it useful.

Categories
Software Development Technology and Internet

Mirroring GitHub Repositories

Git by itself is a distributed version control system (a very popular one), but over the years organizations started to rely on some internet services to manage their repositories and those services eventually become the central/single source of truth for their code.

The most well known service out there is GitHub (now owned by Microsoft), which nowadays is synonymous of git for a huge amount of people. Many other services exist, such as Gitlab and BitBucked, but GitHub gained a notoriety above all others, specially for hosting small (and some large) open source projects.

These centralized services provide many more features that help managing, testing and deploying software. Functionality not directly related to the main purpose of git.

Relying on these central services is very useful but as everything in life, it is a trade-off. Many large open source organizations don’t rely on these companies (such as KDE, Gnome, Debian, etc), because the risks involved are not worth the convenience of letting these platforms host their code and other data.

Over time we have been witnessing some of these risks, such as your project (and all the related data) being taken down without you having any chance to defend yourself (Example 1 and Example 2). Very similar to what some content creators have been experiencing with Youtube (I really like this one).

When this happens, your or your organizations don’t lose the code itself since you almost certainly have copies on your own devices (thanks to git), but you lose everything else, issues, projects, automated actions, documentation and essentially the known used by URL of your project.

Since Github is just too convenient to collaborate with other people, we can’t just leave. In this post I explain an easy alternative to minimize the risks described above, that I implemented myself after reading many guides and tools made by others that also tried to address this problem before.

The main idea is to automatically mirror everything in a machine that I own and make it publicly available side by side with the GitHub URLs, the work will still be done in Github but can be easily switched over if something happens.

The software

To achieve the desired outcome I’ve researched a few tools and the one that seemed to fit all my requirements (work with git and be lightweight) was “Gitea“. Next I will describe the steps I took.

The Setup

This part was very simple, I just followed the instructions present on the documentation for a docker based install. Something like this:

version: "3"

networks:
  gitea:
    external: false

services:
  server:
    image: gitea/gitea:latest
    container_name: gitea
    environment:
      - USER_UID=1000
      - USER_GID=1000
    restart: always
    networks:
      - gitea
    volumes:
      - ./gitea:/data
      - /etc/timezone:/etc/timezone:ro
      - /etc/localtime:/etc/localtime:ro
    ports:
      - "3000:3000"
      - "222:22"

If you are doing the same, don’t copy the snippet above. Take look here for updated instructions.

Since my website is not supposed to have much concurrent activity, using an SQLite database is more than enough. So after launching the container, I chose this database type and made sure I disabled the all the functionality I won’t need.

Part of Gitea's configuration page. Server and Third-Party Service Settings.
Part of the Gitea’s configuration page

After this step, you should be logged in as an admin. The next step is to create a new migration on the top right menu. We just need to choose the “Github” option and continue. You should see the below screen:

Screenshot of the page that lets the users create a new migration/mirror in Gitea.
Creating a new Github migration/mirror in Gitea.

If you choose This repository will be a mirror option, Gitea will keep your repository and wiki in sync with the original, but unfortunately it will not do the same for issues, labels, milestones and releases. So if you need that information, the best approach is to uncheck this field and do a normal migration. To keep that information updated you will have to repeat this process periodically.

Once migrated, do the same for your other repositories.

Conclusion

Having an alternative with a backup of the general Github data ended up being quite easy to set up. However the mirror feature would be much more valuable if it included the other items available on the standard migration.

During my research for solutions, I found Fossil, which looks very interesting and something that I would like to explore in the future, but at the moment all repositories are based on Git and for practical reasons that won’t change for the time being.

With this change, my public repositories can be found in:

Categories
Python

Django Friday Tips: Inspecting ORM queries

Today lets look at the tools Django provides out of the box to debug the queries made to the database using the ORM.

This isn’t an uncommon task. Almost everyone who works on a non-trivial Django application faces situations where the ORM does not return the correct data or a particular operation as taking too long.

The best way to understand what is happening behind the scenes when you build database queries using your defined models, managers and querysets, is to look at the resulting SQL.

The standard way of doing this is to set the logging configuration to print all queries done by the ORM to the console. This way when you browse your website you can check them in real time. Here is an example config:

LOGGING = {
    ...
    'handlers': {
        'console': {
            'level': 'DEBUG',
            'filters': ['require_debug_true'],
            'class': 'logging.StreamHandler',
        },
        ...
    },
    'loggers': {
        ...
        'django.db.backends': {
            'level': 'DEBUG',
            'handlers': ['console', ],
        },
    }
}

The result will be something like this:

...
web_1     | (0.001) SELECT MAX("axes_accessattempt"."failures_since_start") AS "failures_since_start__max" FROM "axes_accessattempt" WHERE ("axes_accessattempt"."ip_address" = '172.18.0.1'::inet AND "axes_accessattempt"."attempt_time" >= '2020-09-18T17:43:19.844650+00:00'::timestamptz); args=(Inet('172.18.0.1'), datetime.datetime(2020, 9, 18, 17, 43, 19, 844650, tzinfo=<UTC>))
web_1     | (0.001) SELECT MAX("axes_accessattempt"."failures_since_start") AS "failures_since_start__max" FROM "axes_accessattempt" WHERE ("axes_accessattempt"."ip_address" = '172.18.0.1'::inet AND "axes_accessattempt"."attempt_time" >= '2020-09-18T17:43:19.844650+00:00'::timestamptz); args=(Inet('172.18.0.1'), datetime.datetime(2020, 9, 18, 17, 43, 19, 844650, tzinfo=<UTC>))
web_1     | Bad Request: /users/login/
web_1     | [18/Sep/2020 18:43:20] "POST /users/login/ HTTP/1.1" 400 2687

Note: The console output will get a bit noisy

Now lets suppose this logging config is turned off by default (for example, in a staging server). You are manually debugging your app using the Django shell and doing some queries to inspect the resulting data. In this case str(queryset.query) is very helpful to check if the query you have built is the one you intended to. Here’s an example:

>>> box_qs = Box.objects.filter(expires_at__gt=timezone.now()).exclude(owner_id=10)
>>> str(box_qs.query)
'SELECT "boxes_box"."id", "boxes_box"."name", "boxes_box"."description", "boxes_box"."uuid", "boxes_box"."owner_id", "boxes_box"."created_at", "boxes_box"."updated_at", "boxes_box"."expires_at", "boxes_box"."status", "boxes_box"."max_messages", "boxes_box"."last_sent_at" FROM "boxes_box" WHERE ("boxes_box"."expires_at" > 2020-09-18 18:06:25.535802+00:00 AND NOT ("boxes_box"."owner_id" = 10))'

If the problem is related to performance, you can check the query plan to see if it hits the right indexes using the .explain() method, like you would normally do in SQL.

>>> print(box_qs.explain(verbose=True))
Seq Scan on public.boxes_box  (cost=0.00..13.00 rows=66 width=370)
  Output: id, name, description, uuid, owner_id, created_at, updated_at, expires_at, status, max_messages, last_sent_at
  Filter: ((boxes_box.expires_at > '2020-09-18 18:06:25.535802+00'::timestamp with time zone) AND (boxes_box.owner_id <> 10))

This is it, I hope you find it useful.

Categories
Personal

The app I’ve used for the longest period of time

What is the piece of software (app) you have used continuously for the longest period of time?

This is an interesting question. More than 2 decades have passed since I’ve got my first computer. Throughout all this time my usage of computers evolved dramatically, most of the software I installed at the time no longer exists or is so outdated that there no point in using it.

Even the “type” of software changed, before I didn’t rely on so many web apps and SaaS (Software as a service) products that dominate the market nowadays.

The devices we use to run the software also changed, now it’s common for people to spend more time on certain mobile apps than their desktop counterparts.

In the last 2 decades, not just the user needs changed but also the communication protocols in the internet, the multimedia codecs and the main “algorithms” for certain tasks.

It is true that many things changed, however others haven’t. There are apps that were relevant at the time, that are still in use and I expect that they will still be around in for many years.

I spent some time thinking about my answer to the question, given I have a few strong contenders.

One of them is Firefox. However my usage of the browser was split by periods when I tried other alternatives. I installed it when it was initially launched and I still use it nowadays, but the continuous usage time doesn’t take it to the first place.

I used Windows for 12/13 straight years before switching to Linux, but it is still not enough (I also don’t think operating systems should be taken into account for this question, since for most people the answer would be Windows).

VLC is another contender, but like it happened to Firefox, I started using it early and then kept switching back and forth with other media players throughout the years. The same applies to the “office” suite.

The final answer seems to be Thunderbird. I’ve been using it daily since 2004, which means 16 years and counting. At the time I was fighting the ridiculously small storage limit I had for my “webmail” inbox, so I started using it to download the messages to my computer in order to save space. I still use it today for totally different reasons.

And you, what is the piece of software or app you have continuously used for the longest period of time?