TOPIC: WEB TECHNOLOGY
Ensuring that website updates make it through every cache layer and onto the web
How Things Used to Be Simple
There was a time when life used to be much simpler when it came to developing, delivering and maintaining websites. Essentially, seeing your efforts online was a matter of storing or updating your files on a web server, and a hard refresh in the browser would render the updates for you. Now we have added caches here, there and everywhere in the name of making everything load faster at the cost of added complexity.
Today, these caches are found in the application layer, the server level and we also have added content delivery network (CDN) systems too. When trying to see a change that you made, you need to flush the lot, especially when you have been a good citizen and added long persistence times for files that should not change so often. For example, a typical Apache configuration might look like this:
<IfModule mod_expires.c>
# Enable expiries
ExpiresActive On
# Default directive
ExpiresDefault "access plus 1 month"
# My favicon
ExpiresByType image/x-icon "access plus 1 year"
# Images
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/png "access plus 1 month"
ExpiresByType image/jpg "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
# CSS
ExpiresByType text/css "access plus 1 month"
# Javascript
ExpiresByType application/javascript "access plus 1 year"
</IfModule>
These settings tell browsers to keep CSS files for a month and JavaScript for a year. This is excellent for performance, but when you update one of these files, you need to override these instructions at every layer. Note that this configuration only controls what your web server tells browsers. The application layer and CDN have their own separate caching rules.
All this is a recipe for confusion when you want to see how everything looks after making a change somewhere. Then, you need a process to make things appear new again. To understand why you need to flush multiple layers, you do need to see where these caches actually sit in your setup.
Understanding the Pipeline
It means that your files travel through several systems before anyone sees them, with each system storing a copy to avoid repeatedly fetching the same file. The pipeline often looks like this:
| Layer | Examples | Actions |
|---|---|---|
| Your application | Hugo, Grav or WordPress | Reads the static files and generates HTML pages |
| Your web server | Nginx or Apache | Delivers these pages and files |
| Your CDN | Cloudflare | Distributes copies globally |
| Browsers | Chrome, Firefox, Safari | Receive and display the content |
Anything generated dynamically, for example by PHP, can flow through this pipeline freshly on every request. Someone asks for a page, your application generates it, the web server sends it, the CDN passes it along, and the browser displays it. The information flow is more immediate.
Static components like CSS, JavaScript and images work differently. They flow through the pipeline once, then each layer stores a copy. The next time someone requests that file, each layer serves its stored version instead of asking the previous layer. This is faster, but it means your updates do not flow through automatically. HTML itself might be limited by this retardation, but not so much as other components, in my experience.
When you change a static file, you need to tell each layer to fetch the new version. You work through the pipeline in sequence, starting where the information originates.
Step 1: Update the Application Layer
After uploading your new static files to the server, the first system that needs updating is your application. This is where information enters the pipeline, and we consider Hugo, Grav and WordPress in turn, moving from simplest to most complex. Though these are examples, some of the considerations should be useful elsewhere as well.
Hugo
Hugo is a static site generator, which makes cache management simpler than dynamic CMS systems. When you build your site, Hugo generates all the HTML files in the public/ directory. There is no application-level cache to clear because Hugo does not run on the server. After modifying your templates or content, rebuild your site:
hugo
Then, upload the new files from public/ to your web server. Since Hugo generates static HTML, the complexity is reduced to just the web server, CDN and browser layers. The application layer refresh is simply the build step on your local machine.
Grav CMS
Grav adds more complexity as it runs on the server and manages its own caching. When Grav reads your static files and combines them, it also compiles Twig templates that reference these files. Once you are in the Grav folder on your web server in an SSH session, issue this command to make it read everything fresh:
bin/grav clear-cache
Or manually:
rm -rf cache/* tmp/*
When someone next requests a page, Grav generates HTML that references your new static files. If you are actively developing a Grav theme, you can disable Twig and asset caching to avoid constantly clearing cache. Edit user/config/system.yaml:
cache:
enabled: true
check:
method: file
twig:
cache: false # Disable Twig caching
debug: true
auto_reload: true
assets:
css_pipeline: false # Disable CSS combination
js_pipeline: false # Disable JS combination
While this keeps page caching on but disables the caches that interfere with development, do not forget to turn them back on before deploying to production. Not doing so may impact website responsiveness.
WordPress
WordPress introduces the most complexity with its plugin-based architecture. Since WordPress uses plugins to build and store pages, you have to tell these plugins to rebuild so they reference your new static files. Here are some common examples that generate HTML and store it, along with how to make them refresh their cached files:
Page Cache Plugins
| Plugin | How to Clear Cache |
|---|---|
| WP Rocket | Settings > WP Rocket > Clear Cache (or use the admin bar button) |
| W3 Total Cache | Performance > Dashboard > Empty All Caches |
| WP Super Cache | Settings > WP Super Cache > Delete Cache |
| LiteSpeed Cache | LiteSpeed Cache > Dashboard > Purge All |
Redis Object Cache
Redis stores database query results, which are separate from page content. If your static file changes affect database-stored information (like theme options), tell Redis to fetch fresh data.
From the WordPress dashboard, the Redis Object Cache plugin provides: Settings > Redis > Flush Cache. An alternative to do likewise from the command line:
redis-cli FLUSHALL
Note this clears everything in Redis. If you are sharing a Redis instance with other applications, use the WordPress plugin interface instead.
Step 2: Refresh the Web Server Cache
Once your application is now reading the new static files, the next system in the pipeline is your web server. Because it has stored copies of files it previously delivered, it has to be told to fetch fresh copies from your application.
Nginx
The first step is to find where Nginx stores files:
grep -r "cache_path" /etc/nginx/
This shows you lines like fastcgi_cache_path /var/cache/nginx/fastcgi which tell you the location. Using this information, you can remove the stored copies:
# Clear FastCGI cache
sudo rm -rf /var/cache/nginx/fastcgi/*
# Reload Nginx
sudo nginx -s reload
When Nginx receives a request, it fetches the current version from your application instead of serving its stored copy. If you are using a reverse proxy setup, you might also have a proxy cache at /var/cache/nginx/proxy/* which you can clear the same way.
Apache
Apache uses mod_cache for storing files, and the location depends on your configuration. Even so, common locations are /var/cache/apache2/ or /var/cache/httpd/. Using the following commands, you can find your cache directory:
# Ubuntu/Debian
grep -r "CacheRoot" /etc/apache2/
# CentOS/RHEL
grep -r "CacheRoot" /etc/httpd/
Armed with the paths that you have found, you can remove the stored copies:
# Ubuntu/Debian
sudo rm -rf /var/cache/apache2/mod_cache_disk/*
sudo systemctl reload apache2
# CentOS/RHEL
sudo rm -rf /var/cache/httpd/mod_cache_disk/*
sudo systemctl reload httpd
Alternatively, if you have mod_cache_disk configured with htcacheclean, you can use:
sudo htcacheclean -t -p /var/cache/apache2/mod_cache_disk/
When Apache receives a request, it fetches the current version from your application.
Step 3: Update the CDN Layer Cache Contents
After your web server is now delivering the new static files, the next system in the pipeline is your CDN. This has stored copies at edge servers worldwide, which need to be told to fetch fresh copies from your web server. Here, Cloudflare is given as an example.
Cloudflare
First, log into the Cloudflare dashboard and navigate to Caching. Then, click "Purge Everything" and wait a few seconds. Now, when someone requests your files, Cloudflare fetches the current version from your web server instead of serving its stored copy.
If you are actively working on a file and deploying repeatedly, enable Development Mode in the Cloudflare dashboard. This tells Cloudflare to always fetch from your server rather than serving stored copies. Helpfully, it automatically turns itself off after three hours.
Step 4: Refresh What Is Loaded in Your Browser
Having got everything along the pipeline so far, we finally come to the browser. This is where we perform hard refreshing of the content. Perform a forced refresh of what is in your loading browser using appropriate keyboard shortcuts depending on what system you are using. The shortcuts vary, though holding down the Shift key and clicking on the Refresh button works a lot of the time. Naturally, there are other options and here are some suggestions:
| Operating System | Keyboard Shortcut |
|---|---|
| macOS | Command + Shift + R |
| Linux | Control + Shift + R |
| Windows | Control + F5 |
Making use of these operations ensures that your static files come through the whole way so you can see them along with anyone else who visits the website.
Recap
Because a lot of detail has been covered on this journey, let us remind ourselves where we have been with this final run-through. Everything follows sequentially:
- Upload your changed files to the server
- Verify the files uploaded correctly:
ls -l /path/to/your/css/file.cssCheck the modification time matches when you uploaded.
- Refresh your application layer:
# Grav bin/grav clear-cache # WordPress - via WP-CLI wp cache flush # Or use your caching plugin's interface - Refresh Redis (if you use it for object caching):
redis-cli FLUSHALL # Or via WordPress plugin interface - Refresh your web server layer (if using Nginx or Apache):
# Nginx sudo rm -rf /var/cache/nginx/fastcgi/* sudo nginx -s reload # Apache (Ubuntu/Debian) sudo rm -rf /var/cache/apache2/mod_cache_disk/* sudo systemctl reload apache2 - Refresh your CDN layer: Cloudflare dashboard > Purge Everything
- Perform a forced refresh of what is in your loading browser using appropriate keyboard shortcuts depending on what system you are using
- Test the page to confirm the update has come through fully
Any changes become visible because the new files have travelled from your application through each system in the pipeline. Sometimes, this may happen seamlessly without intervention, but it is best to know what to do when that is not how things proceed.
Related Reading
Removing query strings from any URL on an Nginx-powered website
My public transport website is produced using Hugo and is hosted on a web server with Nginx. Usually, I use Apache, so this is an exception. When Google highlighted some duplication caused by unneeded query strings, I set to work. However, doing anything with URL's like redirection cannot use a .htaccess file or MOD_REWRITE on Nginx. Thus, such clauses have to go somewhere else and take a different form.
In my case, the configuration file to edit is /etc/nginx/sites-available/default because that was what was enabled. Once I had that open, I needed to find the following block:
location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
try_files $uri $uri/ =404;
}
Because I have one section for port 80 and another for port 443, there were two locations that I needed to update due to duplication, though I may have got away without altering the second of these. After adding the redirection clause, the block became:
location / {
# First attempt to serve request as file, then
# as directory, then fall back to displaying a 404.
try_files $uri $uri/ =404;
# Remove query strings only when necessary
if ($args) {
rewrite ^(.*)$ $1? permanent;
}
}
The result of the addition is a permanent (301) redirection whenever there are arguments passed in a query string. The $1? portion is the rewritten URL without a query string that was retrieved in the initial ^(.*)$ portion. In other words, the redirect it from the original address to a new one with only the part preceding the question mark.
Handily, Nginx allows you to test your updated configuration using the following command:
sudo nginx -t
That helped me with some debugging. Once all was in order, I needed to reload the service by issuing this command:
sudo systemctl reload nginx
With Apache, there is no need to restart the service after updating the .htaccess file, which adds some convenience. The different locations also mean some care with backups when upgrading the operating system or moving from one server to another. Apart from that, all works well, proving that there can be different ways to complete the same task.
A quick look at the 7G Firewall
There is a simple principal with the 7G Firewall from Perishable press: it is a set of mod_rewrite rules for the Apache web server that can be added to a .htaccess file, and there also is a version for the Nginx web server as well. These check query strings, request Uniform Resource Identifiers (URI's), user agents, remote hosts, HTTP referrers and request methods for any anomalies and blocks those that appear dubious.
Unfortunately, I found that the rules heavily slowed down a website with which I tried them, so I am going to have to wait until that is moved to a faster system before I really can give them a go. This can be a problem with security tools, as I also found with adding a modsec jail to a Fail2Ban instance. As it happens, both sets of observations were made using the GTMetrix tool, making it appear that there is a trade-off between security and speed that needs to be assessed before adding anything to block unwanted web visitors.
Moving a website from shared hosting to a virtual private server
This year has seen some optimisation being applied to my web presences, guided by the results of GTMetrix scans. It was then that I realised how slow things were, so server loads were reduced. Anything that slowed response times, such as WordPress plugins, got removed. Usage of Matomo also was curtailed in favour of Google Analytics, while HTML, CSS and JS minification followed. Something that had yet to happen was a search for a faster server. Now, another website has been moved onto a virtual private server (VPS) to see how that would go.
Speed was not the only consideration, since security was a factor too. After all, a VPS is more locked away from other users than a folder on a shared server. There also is the added sense of control, so Let's Encrypt SSL certificates can be added using the Electronic Frontier Foundation's Certbot. That avoids the expense of using an SSL certificate provided through my shared hosting provider, and a successful transition for my travel website may mean that this one undergoes the same move.
For the VPS, I chose Ubuntu 18.04 as its operating system, and it came with the LAMP stack already in place. Have offload development websites, the mix of Apache, MySQL and PHP is more familiar to me than anything using Nginx or Python. It also means that .htaccess files become more useful than they were on my previous Nginx-based platform. Having full access to the operating system using SSH helps too and should mean that I have fewer calls on technical support since I can do more for myself. Any extra tinkering should not affect others either, since this type of setup is well known to me and having an offline counterpart means that anything riskier is tried there beforehand.
Naturally, there were niggles to overcome with the move. The first to fix was to make the MySQL instance accept calls from outside the server so that I could migrate data there from elsewhere, and I even got my shared hosting setup to start using the new database to see what performance boost it might give. To make all this happen, I first found the location of the relevant my.cnf configuration file using the following command:
find / -name my.cnf
Once I had the right file, I commented out the following line that it contained and restarted the database service afterwards, using another command to stop the appearance of any error 111 messages:
bind-address 127.0.0.1
service mysql restart
After that, things worked as required and I moved onto another matter: uploading the requisite files. That meant installing an FTP server, so I chose proftpd since I knew that well from previous tinkering. Once that was in place, file transfer commenced.
When that was done, I could do some testing to see if I had an active web server that loaded the website. Along the way, I also instated some Apache modules like mod-rewrite using the a2enmod command, restarting Apache each time I enabled another module.
Then, I discovered that Textpattern needed php-7.2-xml installed, so the following command was executed to do this:
apt install php7.2-xml
Then, the following line was uncommented in the correct php.ini configuration file that I found using the same method as that described already for the my.cnf configuration and that was followed by yet another Apache restart:
extension=php_xmlrpc.dll
Addressing the above issues yielded enough success for me to change the IP address in my Cloudflare dashboard so it pointed at the VPS and not the shared server. The changeover happened seamlessly without having to await DNS updates as once would have been the case. It had the added advantage of making both WordPress and Textpattern work fully.
With everything working to my satisfaction, I then followed the instructions on Certbot to set up my new Let's Encrypt SSL certificate. Aside from a tweak to a configuration file and another Apache restart, the process was more automated than I had expected, so I was ready to embark on some fine-tuning to embed the new security arrangements. That meant updating .htaccess files and Textpattern has its own, so the following addition was needed there:
RewriteCond %{HTTPS} !=on
RewriteRule ^ https://%{HTTP_HOST}%{REQUEST_URI} [R=301,L]
This complemented what was already in the main .htaccess file, and WordPress allows you to include http(s) in the address it uses, so that was another task completed. The general .htaccess only needed the following lines to be added:
RewriteCond %{SERVER_PORT} 80
RewriteRule ^(.*)$ https://www.assortedexplorations.com/$1 [R,L]
What all these achieve is to redirect insecure connections to secure ones for every visitor to the website. After that, internal hyperlinks without https needed updating along with any forms so that a padlock sign could be shown for all pages.
With the main work completed, it was time to sort out a lingering niggle regarding the appearance of an FTP login page every time a WordPress installation or update was requested. The main solution was to make the web server account the owner of the files and directories, but the following line was added to wp-config.php as part of the fix even if it probably is not necessary:
define('FS_METHOD', 'direct');
There also was the non-operation of WP Cron and that was addressed using WP-CLI and a script from Bjorn Johansen. To make double sure of its effectiveness, the following was added to wp-config.php to turn off the usual WP-Cron behaviour:
define('DISABLE_WP_CRON', true);
Intriguingly, WP-CLI offers a long list of possible commands that are worth investigating. A few have been examined, but more await attention.
Before those, I still need to get my new VPS to send emails. So far, sendmail has been installed, the hostname changed from localhost and the server restarted. More investigations are needed, but what I have not is faster than what was there before, so the effort has been rewarded already.
The wonders of mod_rewrite
When I wrote about tidying dynamic URL's a little while back, I had no inkling that that would be a second part to the tale. My discovery of mod_rewrite, an Apache module that facilitates URL translation. The effect is that one URL is presented to the user in the browser address bar, and the very same URL is also seen by search engines, while another is passed to the server for processing. Though it might sound like subterfuge, it works very well once you manage to get it set up properly. While the web host for my hillwalking blog/photo gallery has everything configured such that it is ready to go, the same did not apply to the offline Apache 2.2.x server that I have going on my own Windows XP box. There were two parts to getting it working there:
- Activating mod-rewrite on the server: this is as easy as uncommenting a line in the
httpd.conffile for the site (the line in question is:LoadModule rewrite_module modules/mod_rewrite.so). - Ensuring that the
.htaccessfile in the root of the web server directory is active. You need to set the values of theAllowOverridedirectives for the server root and CGI directories toAllso that.htaccessis active. Not doing it for the latter will result in an error beginning with the following:Options FollowSymLinks or SymLinksIfOwnerMatch is off which implies that. HavingRewriteRuledirective is forbiddenAllow from Allset for the required directories is another option to consider when you see errors like that.
Once you have got the above sorted, add this line to .htaccess: RewriteEngine On. Preceding it with an Options directive to ensure that FollowSymLinks and SymLinksIfOwnerMatch are switched on does no harm at all and may even be needed to get things running. That done, you can set about putting mod_write to work with lines like this:
RewriteRule ^pages/(.*)/?$ pages.php?query=$1
The effect of this is to take http://www.website.com/pages/input and convert it into a form for action by the server; in this case, that is http://www.website.com/pages.php?query=input. Anything contained by a bracket is assigned to the value of a system-named variable. If you have several bracketed sections, they are assigned to sequentially numbered variables as follows: $1 for the first, $2 for the second and so on. It's all good stuff when you get it going, and not only does it make things look much neater, but it also possesses an advantage when it comes to future-proofing too. Web addresses can be kept constant over time, even if things change behind the scenes. It means that any returning visitors will find what they saw the last time that they visited and surely must ensure good karma in the eyes of those all important search engines.