7. Japan: Frontlines of “Robo-Conversion”

For waiters and waitresses in Japan, jobs are plentiful. For every applicant there are 3.8 wait jobs to choose from, which means 2.8 wait jobs go wanting. For drivers, there are 2.7 jobs per applicant, while for builders the ratio is 3.9 to 1.

That, of course, means that things don’t look as plentiful if you own a restaurant, a trucking company or want to build a shopping mall. Finding enough workers is just about impossible. A recent Financial Times headline says it all: Japan worker shortage has only one winner so far: robots.

6. Plagiarism and Myopia in AI

Extracted from:  https://www.facebook.com/juyang.weng/posts/10155025045069783

Hi, Fei-Fei, I am sorry that this episode takes the form of a letter to you mainly since you are an organizer of ImageNet. Please consider this communication as a purely academic debate that addresses important issues of AI. I am happy that you are doing extremely well in terms of recognition and funding and you are also one of my friends. Please do not be personally offended by this kick-off of scientific debate in AI, because this open discussion is very important for the future of AI.

Your AI systems are also fundamentally not scalable because they do not have their autonomous developmental programs. It is impossible to build a strong AI system without fully autonomous development.

This is a deep and complex subject of science, not a superficial and personalized issue of styles as Mr. Deng Xiaoping argued using his socalled socialism with Chinese characteristics. Unfortunately, isolated greatly within during many years’ of so-called Chinese Communist Revolution, Deng Xiaoping did not gain sufficient knowledge about this deep and complex science. Likewise, many researchers in AI did not grain sufficient knowledge about the breakthroughs in the direction of Autonomous Mental Development (AMD). See Weng, McClelland, Pentland, Sprons, Stockman, Sur, and Thelen: Science, 2001 about what AMD is.

Second, let us look at a plagiarism issue in AI.

You used your work to talk about the future of AI without addressing fundamental AI issues that knowledgeable scientists have long complained about, such as Marvin Minsky (re. neural networks are scruffy and so are yours), Michael Jordan (re. neural networks do not abstract well and so do yours), and John Tsotsos (re: the exponential complexity of visual attention and your systems completely miss the function of autonomous attention).

After our brief conversation during CVPR 2010, I followed up with an email to you June 18, 2010. In that email, I reminded you “brittle vision systems have been well known for long” (implying that your graphic model based vision is brittle) and “we now have a model for brain-mind [5] to address the deep-trenched brittleness in model-based agents (vision and beyond).” However, I did not receive any of your response.

I here cautiously alert to you that our brain model, Developmental Networks (DN), has solved the series of fundamental problems that Marvin Minsky, Michael Jordan, and John Tsotsos raised but did not solve.

However, DN originated from Cresceptron.

Did you plagiarize Cresceptron?

Cresceptron was published by Weng, Ahuja and Huang (IJCNN 1992, ICCV 1993, and IJCV 1997), which is the first deep-learning convolutional networks for general 3-D objects in cluttered backgrounds. We argued (IJCV1997), citing psychological studies, that the human vision system does not have perfect shift-invariance, scale-invariance, or orientation invariance. If there is a monolithic 3-D model in the brain, such invariances seem to be natural consequences. Therefore, any 3-D monolithic model should not exist in the brain.

Specifically, Cresceptron seems to be the first incrementally growing network for (1) detection, (2) recognition, and (3) segmentation of general 3D objects (and scene category) from images of cluttered natural scenes.

As soon as Cresceptron was published, many researchers asked me why I think that brain learns 3D objects from 2-D patches in 2-D images but the brain does not have a monolithic 3-D model. 3-D model based recognition was a prevailing approach till Cresceptron was published then.

Namely, Cresceptron started a revolution in computer vision, which Sandy Pentland suggested me to call appearance based vision, in contrast with then popular aspect graph method that is based on a 3D model.

In your PhD thesis 2005, did you plagiarize Cresceptron for your entire approach?

In Section 9.2, literature review, you cited only some feature extraction techniques that assume 3-D objects (e.g., shape from shading, Hough transform, scale invariance features, and affine-invariant features). In Section 9.3 Contribution, you stated “to learn a completely new object category given very few training examples,” which is exactly what Cresceptron did 12 years before your thesis. The “one-shot” recognition idea in your thesis is also what Cresceptron did. Your Fig. 11.3 is the same, in ideas, to Fig. 1 in Cresceptron ICCV 1993. You wrote about “incrementally” in the 2nd paragraph on page 79, but this is also what Cresceptron did.

ICCV where Cresceptron published was arguably the most visible computer vision conference then, 12 years before your PhD thesis. The complete journal version of Cresceptron was published in IJCV, arguably the most visible computer vision journal 8 years before your PhD thesis. However, your entire PhD thesis did not cite Cresceptron.

Your later papers also used other ideas of Cresceptron, such as the AND and OR structures (see Fig. 3, grouplets, of the CVPR 2010 paper and so on). Cresceptron automatically generated such structures but you only handcrafted them. Yet you still did not cite Cresceptron.

In many of your papers and talks, you cited only the deep convolutional networks of Kunihiko Fukushima, Geoffrey E. Hinton and Yann LeCun for deep learning.

It is well known that Kunihiko Fukushima used a deep convolutional network (Neocognitron) for only classification of isolated, single numerals that is fundamentally 2D, not 3D, not cluttered natural scenes. Till which recent years have Kunihiko Fukushima, Geoffrey E. Hinton and Yann LeCun used their networks to conduct exclusively the classification of isolated, single numerals that is fundamentally 2D?

Also turning your blind eyes to the brain’s developmental program, the ImageNet Contest that you initiated is not only myopic but also misleading the AI field. The ImageNet competitions lack rigorous protocols that any scientific and engineering problem must clearly specify:

1. What is known (e.g., a particular set of static symbolic labels)?
2. What is given (e.g., all training and testing images along with the solutions)?
3. What is assumed (e.g., episodic image classification only and attention-free only)?
4. What are not allowed in producing the results (e.g., can a team design methods after looking at test data, learn after seeing the exam)?
5. How long do the interactive manual developmental processes last (e.g., years across competitions)?

Does each team’s training and testing were under organizer’s due supervision, because the following humans interventions could be critical?

1. test-specific symbolic representations,
2. task-specific algorithms,
3. test-specific algorithms,
4. parameter tweakings,
5. handpicks training and testing data (you called “clean”, but does a child require mom to clean the visual world on his retinas before he can learn?),
6. expansions of computing resources after seeing the test performance,
7. human handpicks only the best one result from many results to report.

Is this a human algorithm assisted by computers? This is not autonomous development that AMD algorithms were meant where every network must not only successful like every human child, but also must be optimal in the sense of maximum likelihood conditioned on the learning experience of each child (network).

But you only stated “computer telling us”, but in fact humans telling us with computers as dumb tools. Those computers in your ImageNet do not truly “see” in the sense of human vision.

For example, the “dual task paradigm” used in your PhD thesis cannot totally eliminate the effect of attention. Many psychologists do not know that. All you need is to understand how top-down attention works and learns in the DN network. Dual-attention is an anytime learned brain behavior.

It is also difficult for a human brain to label every image consistently depending on what pops up from its mind when he looks at the scene under the dual task paradigm. Human attention could be very rich, depending on the individual’s life experience and laboratory instructions.

Was every ImageNet team’s computer program under the dual task paradigm?

You compared the 3.6% machine error rate in 2015 with 5.1% from humans. Is it a fair comparison or a misleading comparison?

Was it honest for you to state “machines’ progress has basically reached, and sometimes surpassed human level” (GIF’17)?

Let us further examine that many humans in each team were in the iterative loop of your contest. Suppose that you have enough money to buy the total competition time of all students in Stanford University to participate in your ImageNet Stanford team, but you only report the student that happened to get the highest score. You claim then “see Stanford education, surpassed the level other humans!”

You cannot afford to sell the entire Stanford team to every human user for a computer vision application. This is why every human brain learns autonomously within a closed skull. Each brain must learn successfully without buying the entire Stanford team into its skull to design representations and tweaking parameters.

I predict that over the years from 2010 to 2015, more and more teams interactively figured out ways to beat ImageNet competitions that you listed GIF’17. For example, some data and symbolic representations were re-used over the years, but you do not care how those data are used. Your ImageNet competition seems to be a race of both human resource and machine resource, where teams could try to fully take advantage of the task limitations of the ImageNet tests. The ImageNet competitions seem not scientific, non-developmental, myopic, and not general-purpose. The contests were designed to bypass the major fundamental bottleneck problems in AI.

The AIML Contest series was meant to overcome major limitations of ImageNet Contests and other contests. It is meant for AMD, where a single learning engine must autonomously develop for any practical sensing modality and motor modality, including vision, audition, and natural language acquisition (text). It is meant for any practical tasks that a human teacher in the environment has on mind; however, the task to be learned is not known to the programmer of the learning engine. I invite you to take a look at the Rules of AIML Contest, which was successfully conducted during 2016 without getting a penny of government funding. AIML Contest will be conducted again in 2017.

Jan. 23, 2017

John Weng, PhD

Professor, Michigan State University, USA