Evolution of Thought


This may be the first post I have done that explicitly relates to something I have done for work. Not to mention an opportunity to show growth in how to approach items and learn from them for the future.

The situation – The new site for work was created and pushed live. All was well until the day we discovered the dev and staging sites had been indexed to Google!

Previously, this had not been an issue for 2 reasons… 1) There was no proper staging site. 2) The dev site was on a completely different platform with a firewall that used ip filtering. I recognize that this was not the best way to develop, but that is a conversation for another post. The new site had more involvement from the company site admin, and we set up the dev and staging environments on his servers, alongside production.

Back to the Google indexing… I believe it was our SEO contractor that notified me about the situation. For a short term solution, I set up the dev and staging sites to return a “410 Gone” error, so Google would eventually remove them from index upon future crawls. This turned out to be relatively simple. As the site is built with Flask, the @app.after_request was utilized to manage all requests.

Oh yeah, I almost forgot to mention the ip validation that was required. I had to validate ip’s because when the site returned a 410 error, it also affected embedded files, like CSS and JS, they essentially “did not exist”. I created an isolated url that would capture the users ip address and store it in a json file. If the user’s ip address was in this file, the site could be viewed normally. An example of the code is below.

validated_ip = f'/path/to/file/valid_ip.json'

@app.after_request
def set_headers(response):
# returns 410 status code to help get dev and staging out of google index
    if app.config.get("SITE_ROOT") != "https://www.prod_site.com":
        with open(validated_ip, "r") as file:
            data = json.load(file)
        if request.remote_addr in data["validated"]:
            return response
        else:
            response.status_code = 410
            return response
    return response

@app.route("/user-ip-validate")
def validate_ip():
    if not os.path.exists(validated_ip):
        data = {"validated": [request.remote_addr]}
        with open(validated_ip, "w") as init_file:
            json.dump(data, init_file)
        return "your ip address has been added"
    else:
        with open(validated_ip, "r") as file:
            data = json.load(file)
        data["validated"].append(request.remote_addr)

        with open(validated_ip, 'w') as write_file:
            json.dump(data, write_file)
        return "your ip address is on the list"

Once all but a few (maybe 6), of the offending pages had been removed, we moved to a log in process. No one would be able to access the dev or staging without proper credentials, and there would be NO scraping!!! (Still have no idea how dev and staging were discovered…) This very effective safeguard was achieved with a few lines of code, using flask_httpauth, targeting the @app.before_request function:

# must import flask_httpauth and associated functions
@app.before_request
def login_for_dev_and_stage():
    # if not public prod site, must have login credentials
    if app.config.get("SITE_ROOT") != "https://www.prod_site.com":
        return auth.login_required(lambda: None)()

The biggest thing I learned is something I already knew, but did not put it into practice; Never assume… Since the admin had taken a more active role, I assumed that he would take care of protecting the sites, and I was wrong. I am not a person who is afraid to ask questions, I just didn’t. A few seconds of clarification can save a LOT of time and effort down the road.