Google Books NGram | Voters | SerpApi, LLC - Old roadmap, we've just migrated our public roadmap to our Github `serpapi`

Google Books NGram

planned

Bill Frischling

The goal we're trying to hit: when did Google first index a term?
before: and after: operators don't work, since if a page was indexed in 2000, it'll show for, e.g. "COVID-19" even though the term didn't appear in 2000 (but the page did).
We like this proxy:
https://books.google.com/ngrams
Useful, but of course the data is obfuscated in an SVG... if it's possible (or another way)...?

February 2, 2022

Emirhan Akdeniz

Pinned

The issue has been moved to: https://github.com/serpapi/public-roadmap/issues/51

Emirhan Akdeniz

The issue has been moved to: https://github.com/serpapi/public-roadmap/issues/51

Illia Zub

marked this post as

planned

Illia Zub

Bill Frischling, Google Books Ngram Viewer has a JSON endpoint:

https://books.google.com/ngrams/json

. It accepts the same parameters and responds with an array of objects.

curl -s --compressed 'https://books.google.com/ngrams/json?content=Albert+Einstein%2CSherlock+Holmes%2CFrankenstein&year_start=1800&year_end=2022' | jq '.[] | keys'
[
  "ngram",
  "parent",
  "timeseries",
  "type"
]
[
  "ngram",
  "parent",
  "timeseries",
  "type"
]
[
  "ngram",
  "parent",
  "timeseries",
  "type"
]

Related researches:

Bill, thank you for this feature request! We'll update this thread when we support Google Books Ngrams. Until then, you can use Google's undocumented API. Make sure you avoid getting blocked by Google.

Bill Frischling

Illia Zub: Love it. Thanks so much!

Justin O'Hara

Hi Bill Frischling I inspected the HTML for https://books.google.com/ngrams and the element for one of the search items.
``<text class="label hover clickable" aria-hidden="true" transform="translate(397.1,45)" x="3" dy=".1em" style="font-size: 15px; fill: rgb(211, 47, 47); opacity: 0.12; font-weight: normal;">Sherlock Holmes</text>
``
What measurables or static data did you want from the HTML that can be found, that we could potentially scrape. If it's not on the HTML then we won't be able to scrape it.

Bill Frischling

Justin O'Hara: Understood. I'm still poking and I was hoping
 the year and % could be extrapolated in some way, but it appears to be quite thoroughly obfuscated unless I'm reading it wrong. The mouseover data is what we are going for, but darned if I can figure how to translate that from the SVG.
We are looking at a couple of code blocks we found that can translate the chart area and SVG points into a relative measurement (e.g. https://stackoverflow.com/questions/43727621/converting-svg-from-highcharts-data-into-data-points) just to see if it can be done (more on the 'damn you Google, we'll prove we can beat the obfuscation' than for any practical use on our end), but it def wouldn't be a straightforward extract from embedded attributes or JSON.
I was hoping I missed something in the code that might have expressly stated "1969" and "0.0000371656" to extract, but sounds like that's not the case.

Justin O'Hara

marked this post as

under review

Ali

Hello Bill,
I hope you are doing well. If you can't do what you are looking to do with Google, I don't think that you can do with SerpApi. We support operators but I see that you already tested.
For the second part, do you request this Google Books Ngrams page as a new API?

Bill Frischling

Ali: Yes... even just to pull basic data on term distribution across date. As part of our algos, we use proxies to try to figure out when a term first came into circulation in common language usage. Trends is great for that, but obvi limited to the time (we like that feature request of course) back to the 1990s. Books NGrams rolls back to 1800, which for our purposes is just AWESOME