Google Books NGram
planned
Bill Frischling
The goal we're trying to hit: when did Google first index a term?
before: and after: operators don't work, since if a page was indexed in 2000, it'll show for, e.g. "COVID-19" even though the term didn't appear in 2000 (but the page did).
We like this proxy:
Useful, but of course the data is obfuscated in an SVG... if it's possible (or another way)...?
Emirhan Akdeniz
The issue has been moved to: https://github.com/serpapi/public-roadmap/issues/51
Emirhan Akdeniz
The issue has been moved to: https://github.com/serpapi/public-roadmap/issues/51
Illia Zub
planned
Illia Zub
Bill Frischling, Google Books Ngram Viewer has a JSON endpoint:
https://books.google.com/ngrams/json
. It accepts the same parameters and responds with an array of objects.curl -s --compressed 'https://books.google.com/ngrams/json?content=Albert+Einstein%2CSherlock+Holmes%2CFrankenstein&year_start=1800&year_end=2022' | jq '.[] | keys'
[
"ngram",
"parent",
"timeseries",
"type"
]
[
"ngram",
"parent",
"timeseries",
"type"
]
[
"ngram",
"parent",
"timeseries",
"type"
]
Related researches:
Bill, thank you for this feature request! We'll update this thread when we support Google Books Ngrams. Until then, you can use Google's undocumented API. Make sure you avoid getting blocked by Google.
Bill Frischling
Illia Zub: Love it. Thanks so much!
Justin O'Hara
Hi Bill Frischling I inspected the HTML for https://books.google.com/ngrams and the element for one of the search items.
``
<text class="label hover clickable" aria-hidden="true" transform="translate(397.1,45)" x="3" dy=".1em" style="font-size: 15px; fill: rgb(211, 47, 47); opacity: 0.12; font-weight: normal;">Sherlock Holmes</text>
``What measurables or static data did you want from the HTML that can be found, that we could potentially scrape. If it's not on the HTML then we won't be able to scrape it.
Bill Frischling
Justin O'Hara: Understood. I'm still poking and I was
hoping
the year and % could be extrapolated in some way, but it appears to be quite thoroughly obfuscated unless I'm reading it wrong. The mouseover data is what we are going for, but darned if I can figure how to translate that from the SVG.We are looking at a couple of code blocks we found that can translate the chart area and SVG points into a relative measurement (e.g. https://stackoverflow.com/questions/43727621/converting-svg-from-highcharts-data-into-data-points) just to see if it can be done (more on the 'damn you Google, we'll prove we can beat the obfuscation' than for any practical use on our end), but it def wouldn't be a straightforward extract from embedded attributes or JSON.
I was hoping I missed something in the code that might have expressly stated "1969" and "0.0000371656" to extract, but sounds like that's not the case.
Justin O'Hara
under review
Ali
Hello Bill,
I hope you are doing well. If you can't do what you are looking to do with Google, I don't think that you can do with SerpApi. We support operators but I see that you already tested.
For the second part, do you request this Google Books Ngrams page as a new API?
Bill Frischling
Ali: Yes... even just to pull basic data on term distribution across date. As part of our algos, we use proxies to try to figure out when a term first came into circulation in common language usage. Trends is great for that, but obvi limited to the time (we like that feature request of course) back to the 1990s. Books NGrams rolls back to 1800, which for our purposes is just AWESOME