Progress & Roadmap*

Popular AI related headlines and news can produce concerns that AI will automate all jobs or take over the world in a few years, poising an existential threat to humanity (see slides 14 and 20 here).  Often these popular press headlines refer to competitions, in which eventually all occupations lose to machines (see for example,  AI vs Doctors).   Regardless of the catchy headlines, AI progress is not as fast as one might think and there are many technological and feasibility limitations slowing up the progress (see Figure 1 below). Some even say Moore’s Law has run out of steam, and exponential change will not stop before petascale and exascale AI systems can become wide-spread and commercially feasible.  However, if Moore’s Law does keep pace, then the widespread availability of AI-produced “digital workers” seems likely, and eventually affordable by everyone, and boosting GDP/employees for nations to higher and higher levels (see Figure 2 below).

AI_Progress Cost_of_Digital_Wokers 1

Figure 1: Projected decrease cost of computing power for human level AI (Source)

AI_Progress GDP_per_employee 2

Figure 2: Projected increase in GDP/Employees for USA (Source).

Among the main limitations of AI progress is availability and cost of computing power and the energy it requires with current computing architectures. For example, it has been estimated that a computing power matching human brain, if implemented with today’s supercomputer technology, would require 12GW of energy (that’s over 7 nuclear power plants). A good expert study on AI progress and impact has been published by Stanford University, giving a more realistic picture on the progress of AI, and reassuringly concluding AI not being an imminent threat to humankind and having potentially positive impact to our society and economy between now and 2030.

Artificial intelligence is researched, developed and applied in wide variety of different domains, each having their own distinct requirements. Despite of the heterogeneity, some basic AI enablers and progress in those can be monitored and followed via different tests, challenges and benchmarks. This site collects resources (Projects, Leader boards, Metrics, Challenges and Competitions) that can be useful for measuring AI progress in tasks that can be seen as enablers for variety of different AI applications in different domains.

AI_Progress Open_Leaderboards 3

Figure 3: AI Progress on Open Leaderboards – Benchmark Roadmap. (See Slide 8 )

Caveat: Not all competitions and leaderboards are created equal.  The “best” leaderboards are highly automated and have entries from all the top industrial, academic, government, etc. research labs, battling it out continuously.   However, in some areas the competitions are still working on accumulating enough labeled data for training and testing, and may have just one competition a year at an annual conference.  We expect to see the evolution of competitions and their leaderboards, including perhaps meta-level competitions and leaderboards for the “one system” that can enter the most competitions and score well across many domains (see for example  I-athalon proposal, and other new testing ideas).


General Artificial Intelligence:

  • General AI Challenge is challenge on development of general AI models with cash prizes. The challenge is organised as rounds, where the first round was dedicated to gradual learning and will be repeated as it was not solved in the first round.

Text understanding, common sense and semantics:

  • SQuAD is reading comprehension dataset and test consisting of questions created via crowd working towards Wikipedia articles, where answers to the questions are segments of text or span from the read text. SQuAD currently includes over 100 000 question – answer pairs on over 500 articles and has a continuously updated leader board for best performing models.
  • VQA is a dataset, yearly AI challenge and leaderboard containging open ended questions about images. The challenge requires combination of visual, language and common sense understanding and is therefore a good challenge for measuring progress in multi-modal understanding.
  • GuessTwo is a machine comprehension task and benchmark based on comparison paragraphs on semantically close but different common entities. Human level performance on a selected test set of paragraphs is 94.2%, which can be used as benchmark for machine comprehension.
  • Story Cloze Test is a test evaluating story understanding with four sentence stories with two endings, where a system should select a correct ending to the story after “reading” it. Measure for understanding temporal and causal relations in stories.
  • Winograd Schema Challenge is an alternative test to the classic Turing test for AI in form of structured multiple choice questions.
  • Allen AI Science Challenge is a dataset, test and leader board for AI models based on standardized US 8th grade science exam. The test is based on multiple choice questions, where training data set includes 2500 exams answered by 8th graders, validation data set 8132 questions of the same type without answers and test data set of 21 298 questions without answers.
  • SemEval is a series of workshops and evaluations on computational semantics analysis systems.

Environment understanding:

  • Cityscapes is a dataset, benchmark and leader board for AI models on semantic labeling and understanding of urban street scenes based on videos from the streets of cities in Germany.
  • Shapenet is an ongoing effort and dataset for annotated 3D objects targeted mainly for computer graphics, computer vision and robotics research.

Conversational intelligence: 

  • DSTC6 is a set of challenges on intelligent dialog systems including goal-oriented dialog learning, conversation modeling and dialogue breakdown detection.
  • The Alexa prize is a challenge on developing a voice interaction based socialbot, which includes elements from multiple fields of conversational AI: knowledge acquisition, natural language understanding, natural language generation, context modeling, commonsense reasoning and dialog planning.
  • NIPS Conversational Intelligence Challenge is a competition and leader board for AI models capable to have intelligent conversations with humans over news articles.

Speech/Voice understanding:

  • An article on comparing natural language understanding of voice interaction based AI assistants.
  • CHiME is yearly renewing challenge for speech separation and recognition in varying contexts.

Visual understanding: 

  • ImageNet is database of images, where images are organised according to WordNet hierarchy resulting to words depicted by collection of images. ImageNet has also organized a yearly challenges on visual machine recognition with leader boards since 2010.
  • VQA is a yearly renewing dataset, challenge and leader board for visual question answering.
  • YouTube 8M is a dataset of millions of YouTube videos labelled with thousands of classes. It has also a related challenge and leader board for producing best video tag predictions.
  • UCF 11-101 are YouTube videos based labelled data sets and a challenge on action understanding from videos.



  • DeepBench by Baidu is collection of neural network libraries used to benchmark performance of basic deep learning operations on different hardware.


  • NIPS Learning to Run is a challenge and leader board for AI models on learning to run and navigate a complex route given the musculoskeletal model of human and the space to navigate in.

* You might wonder where is the information on the roadmap promised in the title. This is something that we plan to provide more information on here as our research work on Opentech AI architecture proceeds and we have something interesting to share with you.