The Over-Engineering of this Website

1. Introduction

To me it always seemed like making your own website was something of a rite of passage for programmers. I’ve also talked to people who don’t care for it, and I understand why they feel that way.

Making a website has never been easier. Not only is there a seemingly unending list of NPM packages and tools that help to make website design easy, but there are also GUI tools like Squarespace which allow even those with no programming experience to create websites. This is decidedly a good thing as having a functional and appealing website has become a necessity for businesses, but it takes some of the fun out of it.

So when I decided to create this website for the purposes of self promotion, I decided it would also have to be a learning process. This is not your everyday, cookie-cutter, first website. But it also is not insanely complex to the point that it gets in its own way. One of my few guiding principles for this site was to find the perfect level of over-engineering, and I believe I have done that.

One more note before we get into the design process behind this site: I hesitate to link to tutorials and guides as I know how frustrating it can be to follow an outdated tutorial only to realize the underlying API has changed. Just know that many great resources on these topics exist, and modern search engines are the programmer's best friend. Part of learning how to program is learning how to google things effectively.

2. The Tools

Source: TunePad

I’ve recently been contributing to TunePad, a website which teaches students how to program by enabling them to write music with code. Because of these contributions, I got to see TunePad's internals, and I was surprised to discover that its backend was written in Python.

Up to this point, I had only really had exposure to Node.js backends. After hearing developer after developer tout the benefits of using Node.js and simplifying the process of building a website to a one-language problem, JavaScript just seemed like the obvious winner.

There are, however, benefits to using Python which may not be readily obvious. JavaScript is a great language, don’t get me wrong, but there have been times when its C-style syntax has seemed clunky to me, and I think the assertion that Python is more readable is fair.

Python has also found something of a niche in fields like data science, machine learning, etc. Being able to use packages like PyTorch, Keras, or Pandas when developing your website is a major benefit, though to be fair it isn’t like JavaScript doesn’t have many excellent packages which would get the job done.

For me, one of the biggest benefits was that I was already way more familiar with Python than with JavaScript, and I’m sure this is true for many people my age. Python is a really friendly language for beginners, but it’s also incredibly capable and is an industry standard in fields like data science and machine learning. As the backend can be more technically challenging than the frontend (note that I did not say more difficult), it’s nice to face it alongside a friend rather than an acquaintance.

For the database, I went with Postgres and SQLAlchemy because I was already familiar with them, and they are known to be reliable. Those of you familiar with the workflow I have described so far may be wondering why I'm not using Flask-SQLAlchemy, and the answer to that question can be found in this lovely article by Edward Krueger.

I chose to use Dart and Sass for the frontend for the same reason (familiarty), but also because they intrigue me. As programmers we have to deal with many layers of abstraction, or as is often the case they are dealt with for us. At some point, any code we write has to be compiled to machine code before it can be run, but we rarely ever get to see that, and there are usually even more layers in between.

It’s cool to be able to write code in Dart and then watch it get compiled to JavaScript. Or to write one line of SCSS and watch it turn into three. The efficiency bonus of having to write fewer lines of code or CSS is also great, and learning a new tool has always been something I have enjoyed.

Another notable package I used was the Python JWT package, which allows for the easy creation and verification/decoding of JSON Web Tokens. For the uninitiated these are cryptographically signed messages which cannot easily be tampered with or faked.

3. Setup

First I generated a new project template, which I did using Dart's Stagehand package. Next I initialized a git repo in the new project folder. My IDE of choice in this case was Microsoft's VSCode, so I started that up. I created an install script for all of the tools I would be using and ran it.

The last step in the setup process was creating a simple Flask app file in Python and running the three continuous build dev tools I'd need: Flask, Dart build_runner, and Sass.

The benefit of using these tools is that whenever I make a change to something, they automatically build almost instantly so I can test even small changes hassle free.

4. Putting it All Together

Now for a bit of website anatomy. When a user accesses your webpage the client sends what's called an HTTP request, which is essentially a way to send information to the server hosting the site and receive a response. Most of the time these come in the form of GET requests. When you load a page you "GET" the HTML file and its accompanying CSS and JS files, and then your browser uses them to render the page.

With Flask you can easily define functions which will be run whenever a user makes a specific type of request on a specific URL. For example, you might want to show your websites homepage whenever a user makes a GET request on the "/" or "/home/" URLs.

@app.route("/", methods=["GET"])
@app.route("/home/", methods=["GET"])
def homepage():
    return render_template("homepage.html")

In this snippet, the function homepage() will be called whenever the server receives a "GET" request on either "/" or "/home/" and the function returns the homepage HTML file after it has been rendered by the templating engine.

Flask also makes use of the Jinja2 templating engine, which allows you to create HTML templates that then can be filled in with information, meaning you don't have to copy and paste a bunch just to have your navigation bar appear on every page. There are tons of other applications for templates, so I reccomend checking the documentation linked above for more details. Then when your templates are finished, you can use the render_template() function to create a final HTML file which gets sent to the user.

4.1 Dart and SCSS

The next step is setting up a static folder with Flask during app creation so it knows where to get CSS content from, and setting up Sass to compile to the same folder. Then you can use the url_for function in your templates to easily include stylesheets. This is how I initialized my app object:

app = Flask(__name__, static_folder="../assets", template_folder="../templates")

Getting Dart (or JavaScript) working is a similar and yet altogether different process. First you have to tell the build_runner where to output to, which for me is "/build/web/dart/" (also "/build/packages/") in my main project folder. I've decided to include the relevant Python code here because it also illustrates how to create more dynamic URLs:

@app.route("/dart/<path:file_name>", methods=["GET"])
def dart_static(file_name):
    return send_from_directory("../build/web/dart/", file_name)

@app.route("/packages/<path:path>", methods=["GET"])
def dart_packages(path):
    return send_from_directory("../build/packages/". path)

As you can probably tell, file_name is a parameter which is extracted from the URL and passed to the dart_static() function, which then can retrieve and deliver the correct dart file to the browser/client. The same can be said for path and dart_packages().

4.2 The Database and the API

Now comes the fun part! For database access, I used SQLAlchemy which made connecting to the database via the engine-session model really easy. You have to tell SQLAlchemy what to expect from the database by creating model classes which inherit from the SQLAlchemy declarative_base class. You can then use these classes to query the database or create and add new rows or columns of information through the session object. Just remember that if you are going to be making any changes to the database that you also need to commit them!

But if the user wants to access information stored in the database, shouldn't we also set up a connection through Dart? The answer is no. If we allowed the client to directly access the database, it would mean sending them login credentials which could then be used for malicious purposes. So how do we serve them content and allow them to make changes?

This problem is most elegantly solved via an "/api/" family of URLs. The browser makes a GET or POST request to these URLs which then can send information back. Let's look at this in the context of uploading a new blog post. We don't want the user to connect directly to the database, so instead we have them send their file through a POST request to something like "/api/blog/upload" where the server receives it and, if the user who sent it is logged in, adds the post and other relevant information about it to the database so it can then be seen by other users of the site.

api diagram

As for how we know the user is logged in, that will be explained in a later section. The idea here is that the client can still send and receive data to/from the database, it just has to be through a public-facing API URL. Another cool bonus of this approach is that if someone were so inclined, they could easily make their own third-party client for the site since the API is exposed. They would simply need to know how it worked, and many big websites have documentation on their APIs, though not all of them publish such material.

5. Content Delivery

When I first told one of my friends that I wanted to create my own website, he recommended that I create something statically-hosted.

Imagine a website which always displays the same content until the creator changes the actual HTML of the site on the server. There’s no database behind a static website, all that matters are the HTML files and any CSS or JavaScript that gets included. For something like a blog, this makes tons of sense; static hosting is easier, and while it would be difficult to generate a user-friendly content upload system around a static website, I would be the sole user so it wouldn’t matter.

So why didn’t I take his advice? I wouldn’t have learned as much. Plus this would have been my blog post about a static website:

I wrote some HTML. Then I wrote some CSS to make the HTML look less bad. I eventually gave up on centering things. Then I used JavaScript to make things look better, like animations and stuff. Then I wrote some blog posts in HTML, and they all lived happily ever after!

See? Not as fun.

Instead, I decided to create an upload system through which a user could upload a markdown file which would then be saved to the database and later served to the user.

Whenever I tackle a complicated problem like this one, I work through a series of smaller proof-of-concepts. The first proof of concept for my content upload system was to see if I could upload a file with Dart. This can be done through the MultipartRequest class and a simple HTML form.

Unfortunately, the way I had been handling authentication thus far (which I'll get to later) did not allow me to easily send multipart requests. I spent the better part of a day rewriting my client authentication functions, but the result was actually a system which was much less convoluted than the previous one.

As I was working on the upload system, I also started writing this blog post, in part to see what other features I might need. For a while everything was going smoothly until I drew out the diagram you see above and realized that I might also want to include images in my posts.

Now the scenario changed from a single file upload to a multiple file upload, meaning something like a zip file would be necessary in order to avoid folder and multi-file uploading. Thankfully I quickly found the built-in Python zipfile module, so I knew this wouldn't be a problem.

The use case I envisioned had users (or just me, really) creating a new folder, creating a markdown file inside of it, and including image files either in the same directory or subdirectories of the same directory. If the directory containing the markdown file was compressed into a zip file, then all of the relevant images would also be compressed and sent along with it as well.

So far so good, right? This process is simple enough for users to understand, and the server can just extract all of the images out of the zip file! We'll just store all of the files in the server's filesystem so all of the relative paths expressed in the markdown file aren't affected...

No! Bad! That's what we have a database for! Storing user generated content locally in the filesystem is dangerous, at least far more dangerous than storing it in the database.

The hard part wasn't storing the images in the database, that was quite easy. Nor was it hard to deliver those images from the database to the client. No, the reason this system took a whole three days to create was that if I wanted my URLs to make sense, I would have to replace the relative paths expressed in the markdown file with URL stubs like "/resource/0" or something similar.

File paths are like strings fixed between two points; they have to go from point A to point B, but anything can happen in the middle. Let me show you what I mean. All of the following paths could be equivalent (in terms of accuracy in reaching the file, obviously some of these paths are of higher quality than others):

images/photos/fish/trout.png (the good example)
images/../images/fish/../fish/trout.png
images\.\photos/./././/fish/\/..\fish/trout.png
not_images\not_photos\random_folder\..\..\..\images\fish\trout.png
images/photos/dogs/havanese/../../fish/./../fish/fresh_water/../trout.png

Are users likely to write paths like these in their markdown files? No. Would I, currently the sole user of my website, ever write a file path like this? Absolutely not. So why did this problem bother me so much? Why did I spend so much time on it?

Because it's an interesting problem, and a deceptively simple one at that. How can you tell when two links are equivalent? And better yet, how do you search a file for links that are equivalent? How do you handle the case of two or more identically named sub-directories which are themselves contained in different subdirectories?

Another reason I took this problem on was to motivate myself to learn about regular expressions, something which had long been on my to-do list but which had never been a priority before now.

So how did I do it? The regular expression ended up being really complicated, so I actually used Python code to build it. This is a good way to abstract a lot of the confusion away so you can be more sure about what you are doing. Here's an example:

slash = r"[/\\]+"

This regular expression looks for one or more forward or backward slashes, and here I've assigned its value to a string named "slash." Then whenever I need a slash in my final regular expression, I can just add the value of "slash" instead of making an already confusing string of characters more confusing.

The first version I wrote ignored the parent directory symbol ("../") in favor of focusing on the current directory symbol ("./"). I created another regular expression to deal with this symbol which I called no_change:

no_change = r"(\." + slash + ")*"

This regex represents a dot followed by a slash. Actually, it represents zero or more of these slash-followed-dot units. Now if we wanted to build a regular expression which would try to match "images/photos/fish/trout.png" we could do it like this:

no_change + "images" + slash + no_change + "photos" + slash + no_change + "fish" + slash + no_change + "trout.png"

This seems pretty repetitive, and indeed I actually never wrote something like this out until now. I actually built the regular expressions using this function:

def buildPathRegex(comps, file_name): #comps -> components of path excluding file
    r = prefix

    #add logic
    for c in comps:
        r += no_change
        r += c
        r += slash
    r += no_change + file_name

    r += postfix
    return r

You might be wondering what prefix and postfix are. We don't want to replace just any file path in the markdown file, only those which occur inside of the image embed syntax, so the prefix and postfix regular expressions look for those to make sure we are only matching paths which are being used to embed images. That's why I can write "api.jpg" here without fear that it will be converted to something else!

Matching paths which do contain parent directory symbols was much, much harder, in part because regular expressions cannot really match something like this. In order to completely match a path which could contain parent directory symbols, you would need to have recursive regular expressions, and the Python flavor of regex does not support recursion.

So, I did the next best thing. I built the path with copy-and-paste (through the power of functions) style recursion with a maximum recursive depth equal to the number of directories in the longest path found in the zip folder.

What? In this case it might be easier to show you first:

def buildExitRegex(max_depth):
    if max_depth == 0:
        return r"[^\"*/:<>?\\|.]+" + slash + no_change + r".." + slash + no_change
    elif max_depth > 0:
        return r"[^\"*/:<>?\\|.]+" + slash + no_change + "(" + buildExitRegex(max_depth - 1) + ")*" + no_change + r".." + slash + no_change

def buildPathRegexR(comps, file_name, max_depth):
    exitR = no_change + r"(" + buildExitRegex(max_depth) + r")*"
    r = prefix + exitR

    #add logic
    for c in comps:
        r += c
        r += slash
        r += exitR
    r += exitR + file_name

    r += postfix
    return r

The buildExitRegex() function creates a regular expression which looks for a path that essentially ends where it started. Paths which use the parent directory symbol will essentially follow this pattern (what happens in between each step is arbitrary so we would also need to include no_change):

in out
in in out out
in in in out out out
in in in in out out out out

You might also see things like:

in out in out
in in out in out out

This is to say that the pattern we are looking for is one which starts with an "in," ends with an "out," and has zero or more things like itself in the middle. If you look at buildExitRegex, that's exactly the behavior it describes, though no one would fault you if you are still confused.

This solved the main problem, but I also had to write a few smaller regular expressions to remove HTML comments and command characters like delete or backspace characters.

Then it was just a matter of storing the content in the database and writing a couple of routes to serve it back to the user when necessary. For me this exploration of regular expressions was a lot of fun, and I now understand not just how to use the tool, but also some of its limitations.

Truth be told there were a few other problems related to displaying the markdown, but they weren't nearly as interesting as this one was.

6. AUTH

And how will I upload those markdown files? How do I manage my blog posts? I log in of course! You may have even noticed the login button up above. Creating a robust login system for a single-user site may seem like overkill, but hear me out. This website does not matter. Well, maybe it matters to me a bit.

What does matter however is that I learn something from this, and authentication systems caught my eye. This site uses the refresh token and access token model which many sites use, except I built this one myself. Here’s how it works.

The user tries to access protected content.
If the user has an access token, then they can see the content.
Else the user is directed to get an access token using their refresh token.
If the user has a refresh token, then they can get an access token. GOTO 2.
Else the user is directed to get a refresh token by logging in. GOTO 4.

Access tokens only last a short amount of time while refresh tokens usually last longer. So even though the access tokens are sent to the server more frequently, and thus are more exposed to attacks, they are not as dangerous because they will expire much sooner. You can cause a lot of damage in 30 minutes, but not as much as you could cause in seven days.

Access tokens are not tracked by the server either, so it’s less work. I’ve used JWT’s which I mentioned earlier. It would be very hard for a malicious attacker to convince the server that a fake access token was real, so the server can trust that if it can validate a token, at some point it created that token.

Refresh tokens, in contrast, are tracked in the database, so if you want to revoke a user’s refresh token and prevent them from getting more access tokens, you can do that.

The tokens are stored in cookies which are HTTP only, secure, and locked to the paths "/api/protected/" and "/api/refresh_token_send/" for access tokens and refresh tokens respectively. This last part means that they won’t be exposed unless they are needed to either access/post content, or to get a new access token.

This system is not perfect, but it was originally designed for efficiency and security, and for my purposes it is overkill. I enjoyed learning about all of the different web security standards, and maybe someday I will use the code I have written now to create a much more advanced system.

7. Conclusion

If you've made it this far, then my over-engineering is working perfectly! Unfortunately I can't really let you test the login or content upload systems since I don't exactly want a bunch of rogue users running around on my site.

You can however check out the git repo if you want to see all of the code. There's probably a lot that could be done to improve it, but code doesn't have to be perfect to work, and that wasn't the point anyway.

Obviously over-engineering is a bad habit to fall into in the real world where you are expected to actually meet deadlines and deploy apps, but as a student about to enter college, I have the luxury of time on my side. I can afford to spend hours and hours learning how to write my own code instead of just using the ready-made solutions of others.

There are still a ton of features I am looking forward to adding to this site in the future, but for now it is pretty functional from the content consumer's standpoint. Some of the producer-oriented interfaces like the login and upload screens still leave something to be desired. The home page is just a picture of me hiding behind my laptop!

I've learned through doing this that I would make a much better backend developer than a frontend developer, though neither is really where my passion lies. This was just a fun project, really a series of mini-projects, which allowed me to explore topics I was interested in. And for about three weeks of work, it turned out okay!

I hope you have enjoyed. Sadly I have not yet built a comment section, but if you're here on my site there's a good chance I directed you to it in the first place so you should know how to contact me if you have any questions or comments. My email is also in the Links section. Thank you for reading!