User:Ottomata/Organized Code

This is a collection of my preferences and best practices to keep code and systems organized and comprehensible, especially for newcomers.

Wherever folks like them or agree, perhaps we could move these into shared team best practices somewhere?

Let me know what you think! Please add comments in talk page or ping me in Slack.

Naming

When coding, naming is very important. The names you use will stick around for decades, and be used to refer to concepts both casually and concretely. To ease understanding and avoid confusion, programmers need to be very careful when naming things. We need to be intentional about naming, for small choices like variable names all the way to big decisions like project and platform names.

Capital letters in data keys is a really bad idea OR snake_case > camelCase

Data moves around. It will be used in different languages with different typing and different naming rules. It will certainly be used in SQL based systems , which are for the most part case insensitive. HTTP URIs and headers are also case insensitive. The only common identifier naming rule that will function in all of these systems is snake_case. (As opposed to CamelCase.)

Any time data passes through a case insensitive system, it will be normalized, most likely to all lower case. Names like isPartOf and mainEntity will become ispartof and mainentity. Longer names that include acronyms get even worse. In camelCase, it isn't clear what the acronym capitalization rules are. E.g. HTTPURLID? HttpUrlId? Whatever the camelCase acronym rule is, the name will be normalized in SQL systems to e.g. httpurlid. Data integration automation code has to reason about which fields are the same. If ingesting data that has capital letters, it is possible that two different fields end up normalized to the same lower cased name. Then we just have to guess about how to ingest data.

Every time someone needs to move camelCased data identifiers in case insensitive systems, they will have to write code that reasons about the case changes. If we avoid upper cased field names in our schemas, we are less likely to encounter bugs and breakages in data pipelines.

Additionally, I've heard that camelCase can be difficult for non native English speakers. incomingHTTPRequestIpAddress (which is normalized to incominghttprequestipaddress) is (subjectively) more difficult to read than incoming_http_request_ip_address.

Be specific

Be specific and include context in names, especially for higher level product / project / interface names.

Imagine you are a newcomer to your project. A newcomer should be able to semi-quickly understand the purpose and context of your project by its name.

Examples:

“Config store” is a bad name. “Datasets config” is better.
"Dumps" is a bad name. "mediawiki content dump files" is better.

Avoid casual names

As humans, we often use imprecise and casual names. Allowing those names to creep into code and systems can make things confusing to newcomers who aren't associated with the people who implemented the code.

Do not use 'platform' in code or systems

A platform is a human imposed grouping of systems and conventions that work together to do a thing. Naming individual technologies or systems with the word 'platform' codifies the higher level grouping that inspired the creation of those systems into those systems themselves.

Systems and code should be agnostic of the platform they live in.

Avoid using team names in code or systems.

Team names change more often than systems. Naming projects, code, repository groupings, deployment instances, clusters, etc. after a team name might seem like the right thing to do now, but it will almost always be wrong later.

Differentiate between 'team' names and 'functional' names. Teams are often named after functions, e.g. 'Data Engineering' team, 'Research' team. Naming things functionally is good! You can still name a grouping of git repositories or a deployed cluster 'research', but that should be done because the thing is for research, not for the Research team.

When you must use team names, always suffix them with 'Team' to avoid confusing between functional concepts and team names.

Nouns before adjectives

English (which most code is written in) usually puts adjectives before nouns, e.g. blue sky, fast horse, default setting, etc.

In code, we often want to group and sort concepts together. Adjectives modify concepts / nouns. To make it easier to group and sort, we should prefer naming things with nouns before adjectives.

Examples:

feature_enabled is better than enabled_feature

feature_setting_default is better than default_feature_setting

feature_enabled_default is better than default_enabled_feature

Booleans should have positive names

Generally, I prefer boolean fields and functions to represent the positive of a condition, rather than a negative. Representing a negative state is prone to reasoning about double-negatives.

Quickly reading and understanding

if (is_feature_enabled)

is easier than:

if (!is_feature_disabled)

And it gets worse when you have to compare multiple things that don't use the same convention!

if (!is_feature_a_disabled && is_feature_b_enabled)

Singular vs plural variables

I prefer to use singular for most things, unless there is a reason to use plural. I like to use plural to name collections of things. E.g. an instance of a Revision I might call $revision, but a list of Revisions might be $revisions.

Maps / dictionary variables

Generally, variables names should be descriptive. Maps or dictionary names should try to describe their keys and values. I like to name maps like 'key to value'. E.g. page_id_to_page or page_by_page_id.

This can get verbose as structures get complex or nested, so this requires some judgement calls and can be a little subjective.

Verbose > abbreviations

Abbreviations are sometimes necessary. However, if not necessary, we should avoid them.

E.g. transfer_destination is better than xfer_dest.

Consistent naming

We often have to re-use original naming concepts in new places. E.g. a config variable, a function parameter passed through many functions, a constructor parameter and a class instance property, an accessor method name, etc.

We should strive to keep variable names as consistent as possible, all the way through from CLI options, form input variables, all the way down to database column names. The names of these variables won't be exactly the same, but it should be easy for a newcomer to the codebase understand the association of these variables.

Documentation

Help newcomers

You understand the systems and code you own and have been working on way more than a newcomer to it. You have spent many hours reading e.g. Spark or Flink API docs or Puppet resource type manuals. You understand the difference between a helm chart and a helm file. You know how to use MediaWiki ServiceWiring and dependency injection.

We should use documentation to help newcomers as much as we can, so that they don't have to grind through learning in the way you have.

Always write file, class and function docs

Always write code documentations, even for private methods and interfaces. You don't have to document all parameters (especially when they repeat in the same file), but all files, classes, and functions should describe their purposes.

Public interfaces should provide usage examples.

Explain complex concepts

Summarize and explain complex concepts you have learned while doing your work.

Example:

You are implementing a new Spark DataFrame Sink. This requires you to understand Spark data types, Sink and Source concepts, DataFrameReaders and Writers, data source framework registration, etc. To learn this, you have read API docs, Spark developer documentation, blog posts, Stack Overflow, and upstream Spark code.

Years from now, some developer other than you will have to make a change to your code, or perhaps implement a new Sink. They will have to learn what you have learned. You can help them save time by including summaries of the things you have learned in your code. Explain why you need a method to convert between DataTypes. Explain what a Spark DataSource registry is. Provide links back to the documentation that helped you learn and understand.

If you do this, you will make it easier for your code and systems to be evolved and improved.

Explain intentions, compromises, and flaws.

We often have to make compromises. Refactoring something might make it easier to build a feature now. Instead we choose to work around the systems limitations to get something done. This is fine. But, when we do this, we should document the workarounds and compromises we've made, and explain the intention for the future.

This will help newcomers understand the compromise, and possibly give them a path to make the systems better. They won't assume that the system limitation or workaround is a fundamental design flaw, and will be able to see why a compromise was made.

This could be done in in code as well as in higher level system docs.

Examples:

Balancing ease of use vs flexibility via configurable smart defaults

We should strive to make our interfaces (CLI, public methods, HTTP APIs, etc.) easy to use. However, we also want to make them flexible and comprehensive, to enable more use cases and to make them more powerful.

There will always be a tension between simplicity and power. This tension can be navigated with a pattern I'll call 'configurable smart defaults'.

Configurable smart defaults - SECTION STILL WIP

Goals:

Users can control any setting
Users have to specify as few settings as possible

To do this, a system should:

Provide agnostic base system 'smart defaults' (definition below)
Provide a way for operators to configure environment specific smart defaults
Still allow users to control any setting

Smart defaults

'Smart defaults' are default values that can be inferred based on the values of other settings. It is sometimes difficult (but not impossible) to implement smart defaults in templating frameworks (ERb, yaml, jinja, go templates, etc.) that don't give you good/easy programatic control (loops, variables, dict merging, etc.). Smart defaults are easier to implement in code.

Provided environment defaults

Where possible, settings should all have base (smart) defaults built into your code. To make it easier for operators to override these base defaults for specific environments, while still allowing end users to control any setting, you might consider adding a setting for provided defaults.

This setting could be added as a wrapper by operators installing your code, or it could be done by your code itself. Here's an example of doing it in your code

# CLI: my_prog --defaults_file=env_defaults.yaml

# my_prog.py

def parse_cli_args(): 
    args_dict = vars(parser.parse_args())
    
    if args_dict['defaults_file']:
        provided_defaults = # read in defaults_file
        args_dict = deepmerge(provided_defaults, args_dict)
        
    # args_dict now contains any user provided settings,
    # as well as defaults provided by the user on the CLI