Paper Pitfalls: Lessons From Reviewing

In reflecting on the last year of reviewing, I noticed that lots of papers getting rejected for the same small set of avoidable reasons. I wanted to share some thoughts on how to avoid being caught in these common pitfalls. My hope is that these thoughts are specific enough to be actionable and helpful to students. These thoughts are inspired by the applied cryptography community, but hopefully parts are applicable to other communities.

I made the dubious decision to join the program committees of 5 conferences over the last year and a half: TCC22 and the applied crypto tracks of USENIX23, Oakland23, CCS22, and CCS23. I came away from this process actually somewhat encouraged by the state of peer review in the field—an outcome I didn’t anticipate. Admittedly I observed some late or low effort reviews, but most reviewers with whom I interacted put thought and effort into their reviews and stayed engaged in the review process.

In the spirit of transparency, here's some quick stats from my year and a half of reviewing: I was assigned 63 papers—a number I was able to complete thanks to the help of many subreviewers. Of those papers, 50 were rejected (18 first round rejections, 24 second round rejects, 2 rejections after major revision, and 6 “regular” rejections at TCC), 5 accepted after revision, and 8 were accepted outright. Looking at average decision scores, my submitted reviews at security conferences were slightly more critical than the average reviewer and my reviews at TCC were slightly less critical.

The TLDR: my experience was that many of the papers I was assigned ended up being rejected for a relatively small set of predictable reasons: not being written in a way that was easy to review [Jump To Section] or providing an unsatisfactory evaluation section [Jump To Section]. Also the rebuttal system seems a little bit broken [Jump To Section].

Papers are Exercises in Communication

This might seem like common sense, but appears to be necessary to reiterate: the goal of writing a paper should be to further the community’s understanding of the field. Notably, this is different from convincing the reader that you are smart or demonstrating that your work is non-trivial. This means that writing quality is an incredibly important part of the research process. While this should be “obvious,” I would say that presentation problems were the most common form of complaint that I saw among reviews (including my own).

The reader should learn something they didn’t know before: When I read a paper, I want to come away having learned something new. Maybe it's a new way to compose cryptographic primitives. Maybe it's that an approach I would have assumed would have failed. Maybe it's a deeper understanding of a particular application area that motivates the need for a new cryptographic approach. I don’t particularly care what I learn, but I’m uncompromising on the fact that papers should teach the readers something.

A good example of a paper that doesn’t teach anything is “we took an existing application and ran it within an MPC.” As a reader, I already know that MPC can evaluate arbitrary functions, and I can estimate the runtime of evaluating a particular functionality with a simple gate count. As a reader, I feel as though I could have sketched the results in about an hour of brainstorming. Another example is “we shift the prior MPC-powered application from the dishonest majority setting to the honest majority setting, and observe that the result is faster.” This just repeats what well-informed MPC readers already know: I expect honest majority MPC to be faster!

A good metric for checking if a paper accomplishes this goal is making sure that a well-informed reader cannot guess the entire technical contents of the paper by reading the description of the problem setting. If they can, it probably means that the paper doesn’t delve deeply enough.

Deal with contributions and prior work: A ton of rejections that I saw came down to reviewer confusion about contributions relative to prior work. It should be crystal clear what’s new in your work, in particular what ideas are new in your work. Clearly list all your contributions, highlight the ways in which your work uses new tricks or concepts that prior work is missing, and compare to prior work. There is no faster way to a rejection than a reviewer remembering a paper they read recently on a similar topic and either being confused about the conceptual delta between the two papers or finding it uncited. Writing a good related work section which explains the ideas present in prior work (ie. more than just block citations with a result statement) is strictly necessary and shouldn’t be hidden at the end of the paper.

Complexity reduces acceptance probability: There appears to be a misconception that complexity is a feasible path to demonstrating non-triviality. Unfortunately, intentionally making a papering look non-trivial in this way just results in confused reviewers. It is much easier to reject a paper because it is confusing than to reject a paper because it lacks novelty. Rejecting a paper on novelty grounds will usually spark a real conversation among the reviewers and it is common for reviewers to come to the defense of simple ideas, arguing that the best ideas are simple. In the end, complexity only reduces the chance that a paper ends up in the accept pile. You are better off communicating that your ideas are simple, but powerful.

On rare occasions, papers are unavoidably complex (of the 63 papers I saw, there were maybe 2 that fell into this category). If this is the case, it’s extra important to highlight a handful of key ideas that guided the work. In truth, reviewers are likely to only understand those key ideas, so make them simple and demonstrate their power.

Failure is an important part of the narrative: Good research should meaningfully explore the design space. A good research paper should demonstrate that the researchers did the due-diligence in exploring the design space and convince the reader that the particular solution the authors chose to describe was the result of careful consideration (rather than simply writing down the first idea the authors imagined). Even if the “obvious” solution is the one that the authors choose to describe, it's critically important to include a discussion of why other potential solutions failed.

Part of this is including the failed approaches you encountered along the way. Including these failures serve multiple purposes: (1) they help ensure that other researchers don’t have to go down the same dead-ends, and (2) demonstrate that the chosen solution isn’t arbitrary.

Acronyms are not your friends: Applied cryptographers appear to be addicted to acronyms. I’m not so concerned about the “standard” acronyms like ABE, IBE, PKE, MPC, 2PC, FHE, ZKP, POK, NIZK, SNARK, SNARG, VDF, PRF, PRG, IOP, PIR, PSI, CRHF, OWF, AEAD, AES, GCM, SHA, TEE, TCB, ECDSA, BLS, IND-CPA, IND-CCA, IND$-CPA, UF-CMA, MAC, DDH, DL, LWE, LPN, CIDH, XDH, DLIN, PQC, MPCitH, 2FA, TOTP, FIDO, TLS, IPSEC, DNSSEC… But, like, can we all agree that this is already a pretty healthy number of acronyms to remember? And maybe not every paper needs to introduce a new acronym to describe the kind of system that they are designing, or a set of the acronyms to refer to the various subcomponents of the proposed system? Adding new acronyms only increases the cognitive load required to read the paper. Moreover, the community probably won’t need a shorthand to refer to the primitive you have designed. On the off chance that they do, you will look so much cooler and casual if you let the community come up with the right acronym instead of trying to push it yourself.

As a rule, you get *maximum* one new acronym before I get annoyed at your paper as a reviewer. And that one better either replace a 10 syllable phrase or be for something that you personally plan on reusing for your next paper.

Pick the Right Evaluation Metrics

Implementation and evaluation sections are simply necessary at security conferences, and a bad implementation and evaluation section can sink an otherwise strong submission. There are two ways for this to happen: (1) the evaluation shows that the proposed solutions are highly inefficient or less efficient than prior work, or (2) the evaluation metrics themselves are lacking, making it impossible to glean meaningful information from the implementation section. I want to focus on this second failure mode, as I rarely see the first one—and when I do, authors tend to already be aware of the problem.

The choice of evaluation metric has to match the story you are telling about your system. That is, it should convince the reader that your proposed system takes significant steps towards deployability for the full system. Evaluations that don’t do this will inevitably miss the mark, and make readers skeptical.

Here are some pitfalls I saw often:

“Our system can evaluate X events in Y seconds:” as a reviewer, I might be an expert in designing cryptographic systems, but I likely have literally no idea about the concrete requirements of a deployed system. Is X the number of queries that similar systems expect in a second? A day? A month? A year? Is Y seconds a lot? Would Y interfere with usability to an extent that it would be a deal breaker? How much money would it cost a company to run the system for those Y seconds? It’s fine if the answer to these questions reveal that the proposed system isn’t ready for deployment tomorrow, but a reader should at least be able to evaluate the implementation within the target context. If the proposal is off by a factor of 100x from deployability, then maybe we can simply optimize our way to a real system. If we are off by a factor of 10000x, maybe it's not quite time to write the systems paper yet and the community should focus on building more efficient versions of the underlying cryptographic primitives
“We provide microbenchmarks:” I get it—I’m a researcher too. Microbenchmarks are infinitely easier to gather than deploying a full system, and I’m guilty of this one. Unfortunately, microbenchmarks often hide costs associated with scaling-up, and make the reader skeptical that the authors have actually considered all the costs associated with a full deployment. In my opinion, microbenchmarks should only be used to complement a full evaluation, highlighting where there are opportunities for continued research and optimization. There are cases in which doing a full evaluation will clearly not be more illuminating than just giving microbenchmarks; in these cases, authors should clearly justify why this is the case.
“We evaluate the same way as prior work:” prior work can be wrong—even if it won an award at a top tier conference. It’s important to compare your work to prior work, but it’s equally important to identify the evaluation metric that demonstrates the intellectual contributions that your paper is making to the field. In fact, it’s a fantastic contribution to say “prior work did an insufficient job evaluating their proposals.”
The subcomponent optimization trap: I saw several papers that were optimizing a subprotocol that is a component of larger, existing cryptographic proposals. Think multiplication subprotocols for MPC (this is not a real example, just an illustration). The motivation for the paper spent most of its effort talking about the importance of improving the higher-level systems, but would then go on to evaluate just the optimized subprotocol. This leads to a natural mismatch between the motivation and the evaluation, leaving the reviewers asking “well, did you accomplish your goal?” If you want to maximize the impact of your work, make sure that the evaluation covers the entire stated goal; if you want to get maximum publications for minimum work, make sure your motivation is cleanly scoped so that reviewers expect less.

A Personal Opinion About Rebuttals

I want to close by sharing a thought about the rebuttal process. I’ve had many informal conversations about the futility of posting a rebuttal. Very few of the rebuttals that I saw this year meaningfully changed the minds of reviewers, despite it being clear that the authors put significant effort into writing their rebuttals. To be clear, I don’t think this is reviewer’s faults; it’s a systemic misunderstanding of how rebuttals *should* be used.

The norm should be to not post a rebuttal: rebuttals are an opportunity to clear up real misunderstandings that are demonstrated by the reviews. However, I found that most rebuttals want to get into minor confusions in the reviews, rather than deal with the high-level concerns. The reality is that fixing minor confusions will likely not meaningfully change the decision outcome. As such, it only really worthwhile to write a rebuttal if the reviewers are really missing the point, which is a rarity.

I don’t actually expect it to become normal to not post a rebuttal. If anything, not positing a rebuttal might be seen as a lack of investment in the process, and could even hurt the chances of a borderline paper. But, I think we could collectively save thousands of wasted hours if we could shift this norm.

A good rebuttal should propose concrete revision criteria: I think the best use of a rebuttal is to propose a concrete list of revision timeline and criteria. For example, “we propose to make the following changes and believe we can do so within the scope of a minor revision.” After rebuttals are in, borderline papers need an advocate who is willing to compose a set of revision criteria—a task that few reviewers seem eager to do. Authors can reduce the onus of putting together these criteria by proposing a concrete path towards acceptance, which will make the reviewer conversation more efficient and productive. In particular, it might mean that an advocate might not actually be necessary, as no one on the reviewer side has to go back through the reviews to compile this list. Instead, a set of tired reviewers can all just sign off on the proposed plan—a very low effort task.